Every recurrent network seen so far reads its input strictly left to right. At time , the hidden state summarises the prefix , and nothing in the architecture gives it access to the suffix . For tasks where decisions at position genuinely depend on what comes next, this is a hard limitation, not a matter of capacity.

A bidirectional RNN lifts the restriction in the simplest possible way: run two recurrent networks in parallel over the same input sequence, one forward in time and one backward, then concatenate their hidden states at every position. Schuster and Paliwal introduced the idea in 1997, in the same year as the LSTM. The two ideas compose: a bidirectional LSTM (BiLSTM) is the standard combination.

Definition

Given an input sequence , a bidirectional RNN maintains two independent recurrences with separate parameters:

The forward recurrence reads the sequence in the natural order; the backward recurrence reads it in reverse. The two cells and have the same shape (vanilla, LSTM or GRU) but distinct parameters , so they specialize independently.

At each position , the bidirectional hidden state is the concatenation of the two:

The downstream prediction head at position sees both halves, and therefore has access to the entire sequence (past and future) through this single vector.

Dimensions used in this note

Colours follow the figure: the forward pass in green, the backward pass in blue.

symbolroleshape
input at position
forward hidden state
backward hidden state
bidirectional hidden state at position
parameters of the two cellsdepends on cell type

For a bidirectional LSTM each direction also carries its own cell state , used internally by the corresponding direction’s gates but never exposed to the other direction or to downstream modules.

"Backward" is just "forward" on the reversed sequence

The backward recurrence is not a new kind of cell. It is an ordinary forward recurrence run on the time-reversed input: the sequence is reversed to , a standard left-to-right cell is run over it, and the resulting states are flipped back to the original order. This is exactly how frameworks implement it, and it is why any recurrent cell, vanilla, LSTM or GRU, becomes bidirectional for free: bidirectionality wraps the time axis, it is not a property of the cell. The only thing that distinguishes from is its separate parameters, learned to summarise suffixes rather than prefixes; the arithmetic is identical.

What each half encodes, and why two halves are necessary

The two halves are not redundant; they encode genuinely different information.

  • summarises the prefix : everything the model has seen up to and including position .
  • summarises the suffix : everything from position to the end.

Their concatenation is therefore a representation of position in the context of the entire sequence, not just its prefix. Any decision the network needs to make at position , for example a part-of-speech tag, a named-entity label, or an acoustic phoneme classification, can now condition on both directions.

When a single direction is structurally insufficient

Consider a part-of-speech tagger reading the sentence “They saw the saw”. At the second word, saw, the correct tag (verb) is disambiguated only by what comes after (the article the and noun saw). A unidirectional model would have to commit before having that evidence; a bidirectional model defers the decision until both directions have run, then combines their summaries.

The need is structural: no matter how wide or deep the forward RNN is, it cannot see the future. Bidirectionality is the architectural fix, not a hyperparameter to tune.

When bidirectional models can and cannot be used

The benefit of bidirectionality is bought at a precise cost: the entire input sequence must be available before any output is produced, because the backward pass cannot start until the last position is known. This rules out bidirectional models for two important classes of tasks.

  • Online or streaming inference. A speech recognizer that must transcribe as the audio arrives cannot wait for the end of the utterance. A trading system, a real-time captioner, an interactive agent: all need their output at time to depend only on inputs through time . Unidirectional models are the only option here.
  • Autoregressive generation. A language model that generates the next word from the previous words cannot use a backward pass over the future, because the future does not exist yet. The whole point of the model is to produce it. Decoder-side recurrences in sequence-to-sequence models are always unidirectional for this reason.

Conversely, bidirectional models are the natural choice whenever the full input is in hand before any output is needed: tagging tasks (POS, NER, chunking), sentence-level classification, the encoder side of translation, offline acoustic modelling, and the encoder side of most retrieval-augmented architectures.

Modern attention-based models and bidirectionality

The encoder of a Transformer (BERT, RoBERTa, encoder-only models in general) is bidirectional by construction: self-attention at position attends to all positions , future included, with no recurrent ordering. The architectural distinction that mattered for RNNs (forward vs backward recurrence, concatenated states) dissolves when the model has no recurrence to begin with: bidirectionality becomes a property of the attention mask, not of the cell. Decoder-side or autoregressive Transformers (GPT-family models) impose a causal mask precisely to recover the unidirectional constraint that generation requires.

Cost of bidirectionality

The two directions are independent forward passes, so the additional cost is straightforward.

  • Parameters double. Two recurrent cells with independent weights, instead of one. For a BiLSTM of width , the parameter count is .
  • Compute roughly doubles. Each input position is processed by both directions. The two recurrences are independent and can be computed in parallel given enough hardware, so wall-clock time grows less than when memory and parallelism allow.
  • Hidden-state width per position doubles. Downstream layers that consume must be sized accordingly: a prediction head expecting inputs, an attention module attending to a -dimensional sequence, and so on.

These are all linear factors of . The architectural lift is large compared to the cost, and bidirectional encoders were standard in pre-Transformer NLP and speech for exactly that reason.

Composing with stacking: deep bidirectional RNNs

Bidirectionality and stacking are orthogonal. A deep bidirectional RNN is built by replacing each layer of a stacked RNN with a bidirectional version of the same cell. At each layer , the input is the previous layer’s concatenated output , processed by two independent recurrences whose outputs are again concatenated:

with .

The forward and backward recurrences at one layer are independent of each other within that layer (no information passes from the forward direction to the backward direction or vice versa, except indirectly through the next layer above). The two directions interact only when their concatenated outputs are fed jointly into the next layer.

The pre-Transformer NLP encoder

A stack of 2 to 4 bidirectional LSTM layers was the standard encoder for sequence-tagging and classification tasks in NLP from roughly 2015 to 2018; ELMo (Peters et al., 2018) is the best-known instance of the design. The encoder of a Transformer replaced this stack with a stack of self-attention blocks, gaining parallelism along the time axis (no sequential dependency) and direct long-range routing (no recurrent bottleneck), while keeping the bidirectional property as a property of the attention mask. Reading “BiLSTM encoder” in older papers and “Transformer encoder” in newer ones is essentially reading two different implementations of the same architectural intent: a contextual representation of every position that depends on the entire sequence.

A note on the backward initial state

The forward recurrence starts from , by the convention used throughout these notes. The backward recurrence starts from , the analogous zero state on the right end of the sequence. Both initial states can be made learnable parameters in implementations that benefit from a non-zero starting representation, but the convention above is sufficient for almost all uses and is what frameworks like PyTorch use by default.

The construction completes the recurrent toolkit: cells (vanilla, LSTM, GRU), depth, and now direction. The next note assembles that toolkit into the architecture that carried recurrent networks into production, the encoder-decoder, whose fixed-width bottleneck is in turn the doorway to attention, the subject of the next section.