Vanilla RNN

Recurrence as a Memory Mechanism

A feedforward network has no notion of “before”. Each input is mapped to an output through a fixed stack of layers, and no internal state survives between inputs. For sequential data (i.e., language, audio, time series) this is a real limitation: the meaning of the current input almost always depends on what came before it.

Important

Recurrence is the oldest, and conceptually simplest, way to inject memory into a neural network. The network acquires an internal state that persists across steps, so that processing the input at time $t$ is allowed to depend on everything the network has seen up to time $t - 1$ .

The idea is structural. Take any standard layer (fully connected, convolutional, or whatever) and turn it into a recurrent layer by adding one rule: at each time step $t$ , the layer produces a vector $h_{t}$ , and that same vector is fed back into the layer at step $t + 1$ , alongside the next input. The output of one step becomes part of the input of the next.

The vector $h_{t}$ is called the hidden state since:

the layer producing it is typically not the final layer of the network, so $h_{t}$ is internal (hidden) in the same sense as the activations of an MLP hidden layer,
$h_{t}$ is defined recursively: it depends on $h_{t - 1}$ , which depends on $h_{t - 2}$ , and so on all the way back to $h_{0}$ . The state at time $t$ is not just an output; it is a compressed summary of the entire past.

What used to be a forward pass, input in, output out, is now a recursion that unfolds over time, in principle without bound. John Hopfield was among the first to recognize that this kind of feedback was a viable memory mechanism for neural networks.

Why unrolling matters

The compact loop on the left of the figure captures the memory mechanism well, but it is the wrong picture for training. The right side of the figure shows the same network unrolled through time: the same recurrent cell appears once per time step, and the loop becomes a straight chain.

When training on segments of length $50$ , the cell is applied $50$ times. Starting from $x_{1}$ , the cell produces $h_{1}$ using the convention $h_{0} = 0$ , justified by the absence of any prior memory before the first step (see BPTT for the analogy with zero-padding in CNNs). Feeding $x_{2}$ together with $h_{1}$ produces $h_{2}$ . Then $x_{3}$ and $h_{2}$ produce $h_{3}$ , and so on. The unrolled graph is the object that backpropagation actually differentiates through.

Cell vs layer

The two terms name different things, and the distinction is worth fixing now.

A cell is the function of a single time step: it maps the current input and the previous state, $(x_{t}, h_{t - 1})$ , to the new state $h_{t}$ . It is the building block.

A layer is that cell together with the recurrence that applies it along the whole sequence.

A recurrent layer is therefore not a stack of $T$ distinct cells. It is one cell, with one set of parameters, applied $T$ times. The unrolled diagram draws the cell $T$ times only to expose the temporal computation; every copy shares the same weights. The structural consequence of this reuse is the subject of the next callout.

The same split is reflected in PyTorch: nn.RNNCell is the cell (one step, called inside a loop), while nn.RNN is the layer (the cell already wrapped in the loop over the sequence).

One more boundary is worth drawing: the output head is not part of the recurrent layer. The recurrent layer maps the input sequence to the hidden-state sequence $(h_{1}, \dots, h_{T})$ and owns only $W_{x h}, W_{hh}, b_{h}$ . The output head $\hat{y}_{t} = ϕ_{o} (W_{y h} h_{t} + b_{y})$ is a separate read-out layer, with its own parameters $W_{y h}, b_{y}$ , applied on top of each hidden state; in PyTorch it is the nn.Linear placed after nn.RNN. Both the recurrent layer and the output head are shared across time, but only the recurrence belongs to the recurrent layer.

Parameter Sharing and Depth

Because the unrolled architecture reuses one cell at every step, it is functionally equivalent to a very deep MLP in which every layer shares the same weights. This is the principle of parameter sharing.

Depth and parameter sharing are the two structural ingredients usually associated with the term deep learning, and a recurrent network has both.

Historical perspective

The recursive view was already understood as a form of deep learning in the earliest days of recurrent networks. An RNN unrolled over $50$ steps is, computationally, a $50$ -layer network. Various efforts on recurrent connectionist models since the early 1980s culminated in Elman’s RNN layer (1990), which is the cell formalized in the next section. The community working on RNNs in the 1980s would later argue, with some justification, that they had been doing deep learning before the CNN era made the term fashionable.

This chain-like structure is exactly what makes RNNs a natural model for sequences and lists: their computation is organized around ordered data and a state that is carried forward. The same cell can be applied to sequences of any length while preserving temporal order, which is why RNNs accommodate sequence-to-sequence, sequence-to-label, and label-to-sequence problems uniformly. The taxonomy of these mappings is treated in depth in Taxonomy of Sequential Problems.

Vanilla RNN Cell

The cell described in this section is the one formalized by Jeffrey L. Elman in $1990$ , which crystallized roughly a decade of earlier work on recurrent connectionist models into the form still used today.

A vanilla RNN cell takes the current input $x_{t}$ and the previous hidden state $h_{t - 1}$ , combines them through two affine transformations and a single bias term, and passes the result through a nonlinearity.

The index convention adopted throughout these notes is

t = 1, \dots, T, h_{0} = 0 .

At the most abstract level, the state transition is just a function

h_{t} = f_{h} (x_{t}, h_{t - 1}) .

For a vanilla RNN, this function is instantiated as

a_{t} h_{t} = W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h}, = ϕ_{h} (a_{t}) .

The vector $a_{t}$ is the pre-activation of the recurrent cell, and $ϕ_{h} (\cdot)$ is an element-wise nonlinearity applied to it. In a vanilla RNN the canonical choice is

ϕ_{h} (\cdot) = tanh (\cdot) .

The two affine projections can equivalently be written as a single matrix-vector product on the concatenation of input and previous state,

a_{t} = [W_{x h} ∣ W_{hh}] [x_{t} h_{t - 1}] + b_{h},

where $[W_{x h} ∣ W_{hh}]$ denotes column-wise block concatenation of shape $n_{neurons} \times (n_{inputs} + n_{neurons})$ . This form is what fused GPU kernels actually compute: a single matrix multiplication instead of two. The mini-batch matrix version of the same identity is derived in BPTT.

The model output is then read off the hidden state:

o_{t} \hat{y}_{t} = W_{y h} h_{t} + b_{y}, = ϕ_{o} (o_{t}) .

The matrix $W_{y h}$ maps the hidden state to the logits $o_{t}$ , not to probabilities. In a classification setting, $ϕ_{o} (\cdot)$ is typically the softmax, so $\hat{y}_{t}$ becomes a probability vector over classes.

Four integers govern all the shapes that follow:

$n_{inputs}$ is the size of the input vector at each time step (the same feature dimension written $d$ in Taxonomy of Sequential Problems; here it is renamed to match PyTorch’s input_size),
$n_{neurons}$ is the hidden size of the recurrent layer, i.e., the dimension of the hidden state $h_{t}$ . This is the same quantity that PyTorch calls hidden_size in torch.nn.RNN.
$C$ is the number of output classes.
$T$ is the length of the sequence, indexed as $t = 1, \dots, T$ .

Symbol	Dimension	Description
$x_{t}$	$R^{n_{inputs} \times 1}$	Input vector at time $t$ .
$h_{t - 1}$ , $h_{t}$	$R^{n_{neurons} \times 1}$	Previous and current hidden states.
$a_{t}$	$R^{n_{neurons} \times 1}$	Recurrent pre-activation.
$o_{t}$	$R^{C \times 1}$	Logit vector.
$\hat{y}_{t}$	$R^{C \times 1}$	Predicted output vector.
$W_{x h}$	$R^{n_{neurons} \times n_{inputs}}$	Input-to-hidden weight matrix.
$W_{hh}$	$R^{n_{neurons} \times n_{neurons}}$	Hidden-to-hidden recurrent weight matrix.
$W_{y h}$	$R^{C \times n_{neurons}}$	Hidden-to-output/logit weight matrix.
$b_{h}$	$R^{n_{neurons} \times 1}$	Hidden-state bias vector.
$b_{y}$	$R^{C \times 1}$	Output/logit bias vector.

Hidden State vs Cell State

In vanilla RNNs the recurrent memory variable is uniformly called the hidden state $h_{t}$ .

The term cell state is reserved for LSTMs, where the cell state $c_{t}$ is a separate quantity from the hidden state $h_{t}$ and plays a distinct role in the dynamics.

The unrolled graph makes the full sequence of computations explicit:

x_{1}, x_{2}, \dots, x_{T} ⟼ h_{1}, h_{2}, \dots, h_{T} ⟼ \hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{T} .

Because the same transition is applied at every step, the hidden state at time $t$ can be expanded by repeated substitution:

h_{t} = f_{h} (x_{t}, f_{h} (x_{t - 1}, \dots f_{h} (x_{1}, h_{0}))) .

Naming the intermediate hidden states makes the recursion easier to visualize:

So $h_{t}$ is a compressed summary of the input prefix up to time $t$ , and the output is a function of the inputs seen so far,

\hat{y}_{t} = g_{t} (x_{1}, x_{2}, \dots, x_{t}) .

This statement is purely about the forward computation. The training procedure obtained by differentiating through the unrolled graph is the subject of Backpropagation Through Time, and the stability problems caused by long products of recurrent Jacobians are analyzed in BPTT Problems.

A Minimal PyTorch Implementation

The cell defined above maps to a few lines of PyTorch in a one-to-one fashion. Each line of the forward method corresponds to one equation of the cell, with identical naming. The weight matrices $W_{x h}$ , $W_{hh}$ , $W_{y h}$ have the dimensions declared in the table above and in BPTT.

import torch
import torch.nn as nn
 
class VanillaRNNCell(nn.Module):
    """One Elman step:
           a_t = W_xh x_t + W_hh h_{t-1} + b_h
           h_t = tanh(a_t)
           o_t = W_yh h_t + b_y
    """
    def __init__(self, n_inputs: int, n_neurons: int, n_classes: int):
        super().__init__()
        # nn.Linear(in, out) stores its weight tensor with shape (out, in),
        # which matches the matrix orientation used in this note and in BPTT:
        #   W_xh in R^{n_neurons x n_inputs}
        #   W_hh in R^{n_neurons x n_neurons}
        #   W_yh in R^{n_classes x n_neurons}
        # Bias is disabled on W_xh because the single b_h of the math is
        # carried by W_hh; this is purely an implementation choice.
        self.W_xh = nn.Linear(n_inputs,  n_neurons, bias=False)
        self.W_hh = nn.Linear(n_neurons, n_neurons, bias=True)   # provides b_h
        self.W_yh = nn.Linear(n_neurons, n_classes, bias=True)   # provides b_y
 
    def forward(self, x_t: torch.Tensor, h_prev: torch.Tensor):
        """
        x_t    : (batch, n_inputs)
        h_prev : (batch, n_neurons)
        returns:
            h_t : (batch, n_neurons)
            o_t : (batch, n_classes)
        """
        a_t = self.W_xh(x_t) + self.W_hh(h_prev)   # pre-activation
        h_t = torch.tanh(a_t)                      # hidden state
        o_t = self.W_yh(h_t)                       # logits
        return h_t, o_t

Why nn.Linear?

Each weight matrix in the cell ( $W_{x h}$ , $W_{hh}$ , $W_{y h}$ ) paired with its bias defines an affine transformation $y = W x + b$ . PyTorch’s nn.Linear is the canonical implementation of exactly this operation, with the parameter tensors registered for autograd and standard batching support. Using nn.Linear therefore makes the code a literal translation of the math.

The cell computes one step of the recurrence. The driver below unrolls it over the $T$ time steps $t = 1, \dots, T$ , starting from the conventional initial state $h_{0} = 0$ . The runtime variable seq_len equals $T$ . Python arrays are zero-indexed, so x_seq[0] corresponds to $x_{1}$ in the math notation; the loop index i is therefore offset by one from the math index $t$ .

def unroll(cell: VanillaRNNCell, x_seq: torch.Tensor):
    """
    x_seq : (seq_len, batch, n_inputs)
            x_seq[i] corresponds to x_{i+1} in the 1-indexed math notation
    returns:
        H : (seq_len, batch, n_neurons)   hidden states  h_1, ..., h_T
        O : (seq_len, batch, n_classes)   logit vectors  o_1, ..., o_T
    """
    seq_len, B, _ = x_seq.shape
    h_t = torch.zeros(B, cell.W_hh.out_features)   # h_0 = 0
    H, O = [], []
    for i in range(seq_len):
        h_t, o_t = cell(x_seq[i], h_t)
        H.append(h_t); O.append(o_t)
    return torch.stack(H), torch.stack(O)

Note

Three observations are worth making explicit.

the loop for i in range(seq_len) is the temporal unrolling of the previous section made executable: at every iteration, the same cell instance is invoked, and therefore the same parameters $W_{x h}$ , $W_{hh}$ , $W_{y h}$ , $b_{h}$ , $b_{y}$ are reused. The principle of parameter sharing is not stated in the code; it is enforced by construction, because the cell is a single object.

PyTorch’s autograd records the entire chain of operations produced by this loop into one computational graph. Calling .backward() on a loss computed from O is exactly what Backpropagation Through Time performs: the gradients flow backward through every time step, accumulating into the shared parameters along the way.

the cell operates on a mini-batch of sequences by construction, with no extra code: nn.Linear accepts any leading batch dimension, so each forward step processes $m$ examples in parallel, where $m$ denotes the mini-batch size. This exploits cuBLAS-level matrix multiplies on the GPU. The same batching also reduces the variance of the stochastic gradient by a factor of $m$ , which is the reason mini-batch SGD is the default training regime; this point is developed in detail in Mini-Batch SGD. The matrix-form recurrence on the batch and the transposed PyTorch convention input @ W.T are derived in BPTT.

Use the built-in versions in practice

The nn.RNNCell and nn.RNN introduced in the Cell vs layer callout above are the production equivalents of this hand-written code, backed by optimized CUDA kernels, and should be preferred for performance. The implementation here is purely pedagogical: its purpose is to make the correspondence between equations and code transparent, not to be fast.

PyTorch's two biases: bias_ih and bias_hh

The cell above carries a single bias $b_{h}$ , following the standard mathematical convention used in this note and in most textbooks. PyTorch’s built-in nn.RNNCell and nn.RNN adopt a different convention: they expose two bias vectors, bias_ih and bias_hh, and compute the recurrent update as
$h_{t} = tanh (W_{x h} x_{t} + b_{ih} + W_{hh} h_{t - 1} + b_{hh}) .$
The two parameterizations are functionally equivalent up to the identification
$b_{h} = b_{ih} + b_{hh},$
because two bias vectors added to the same linear combination collapse into one. The split is a pure implementation detail, motivated by the symmetry of the fused CUDA kernels: the same layout is reused for LSTM and GRU, where each gate carries its own (W_ih, b_ih) + (W_hh, b_hh) pair. PyTorch’s parameterization therefore has $n_{neurons}$ redundant scalars per layer, since any trained (b_ih, b_hh) can be rewritten as (b_ih + b_hh, 0) without changing the function computed by the cell.

Why tanh?

In a vanilla RNN the hidden state is not merely the output of a layer. It is the quantity that will be fed back into the same layer at the next step:

h_{t} = tanh (a_{t}), a_{t} = W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h} .

The choice of nonlinearity is therefore not a cosmetic detail. It shapes the dynamics of the memory itself. A good hidden activation for a simple recurrent cell should satisfy three requirements:

let the state carry signed information,
keep the state numerically bounded when reused across many steps,
preserve enough local sensitivity for small updates to propagate.

Each of these requirements, taken on its own, is enough to disqualify common alternatives. The hyperbolic tangent satisfies all three simultaneously.

1) A Signed, Bounded State

The range of $tanh$ is

tanh (x) \in [- 1, 1] .

Each hidden coordinate can therefore encode both positive and negative evidence: values near $1$ encode strong evidence in one direction, values near $- 1$ strong evidence in the opposite direction, and values near $0$ a weak or neutral signal.

A sigmoid hidden activation, by contrast, is constrained to $[0, 1]$ . It can represent degrees of presence but is not centered around zero. In a recurrent cell this matters: the affine term $W_{hh} h_{t - 1}$ then receives an input with a persistent positive offset, and contributions cannot cancel out cleanly. A zero-centered state is, in practice, much easier to combine linearly, since positive and negative coordinates can reinforce, oppose, or cancel each other.

The bounded range plays a second, more dynamical role. Since $h_{t}$ is reinjected into the same layer at the next step, an unbounded activation would let the state drift to arbitrarily large magnitudes after enough iterations, with no intrinsic scale to anchor it. The $tanh$ acts as a smooth soft-clipping function: moderate signals pass through almost unchanged, while no coordinate is ever allowed outside $[- 1, 1]$ .

2) A Nearly Linear Region Around Zero

Near the origin, $tanh$ behaves almost like the identity:

tanh (z) \approx z for small z .

This is the desired operating regime for a healthy hidden state. Small changes in the pre-activation $a_{t}$ produce nearly proportional changes in $h_{t}$ : the cell can update its memory smoothly, rather than snapping to a saturated value at the first significant input.

The derivative makes this precise:

\frac{d}{d z} tanh (z) = 1 - tanh^{2} (z), tanh^{'} (0) = 1.

So, near zero, the activation does not locally shrink gradients. This does not solve the long-range stability problem of RNNs, which arises from repeated products of recurrent Jacobians during BPTT and is analyzed in BPTT Problems. But at the level of a single cell, $tanh$ offers strictly stronger local sensitivity than the sigmoid.

3) Comparison With Sigmoid

The sigmoid derivative is

σ^{'} (z) = σ (z) (1 - σ (z)),

with maximum value

z max σ^{'} (z) = \frac{1}{4} .

Even at its most sensitive point, the sigmoid locally scales gradients by at most $0.25$ . The hyperbolic tangent attains a maximum derivative of $1$ . This factor-of-four gap is one of the main reasons $tanh$ became the default hidden activation in vanilla RNNs.

The two activations are not independent functions; they are related by

tanh (x) = 2 σ (2 x) - 1,

so $tanh$ is a rescaled and shifted sigmoid. The shift is the entire point: it carries the activation from the positive interval $[0, 1]$ to the zero-centered interval $[- 1, 1]$ .

Derivation

$σ (z) = \frac{1}{1 + e ^{- z}}$

$tanh (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}$

$σ (2 x) = \frac{1}{1 + e ^{- 2 x}} = \frac{e ^{2 x}}{e ^{2 x} + 1}$

$2 σ (2 x) - 1 = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}$

Therefore,
$tanh (x) = 2 σ (2 x) - 1.$

4) Saturation: Useful and Dangerous

The same boundedness that keeps the hidden state under control also introduces a failure mode. For large positive or negative pre-activations, $tanh$ saturates:

tanh (z) \to 1 or tanh (z) \to - 1,

and the derivative collapses,

tanh^{'} (z) = 1 - tanh^{2} (z) \approx 0.

Intuitively, once a hidden coordinate is pinned near $\pm 1$ , changing the input a little barely moves the output. The unit has become confident, and therefore insensitive. As a single-step phenomenon this is benign, even desirable: it is what gives the soft-clipping behavior. Chained over many time steps, however, saturation contributes directly to the vanishing-gradient dynamics of vanilla RNNs.

The opposite failure mode, exploding gradients, is not prevented by $tanh$ either. Bounded activations do not imply bounded gradients: the gradient of the loss with respect to the recurrent parameters depends on the product of recurrent Jacobians $\prod_{t} diag (tanh^{'} (a_{t})) W_{hh}$ , which can grow exponentially in $t$ whenever $∥ W_{hh} ∥ > 1$ , regardless of how saturated the individual units are at each step. The full analysis of both failure modes is in BPTT Problems, and the standard optimizer-level remedy is in Gradient Clipping.

This is the trade-off in a single sentence: $tanh$ gives a signed, bounded memory state with strong local sensitivity around zero, at the price of attenuating gradients in the saturated regime while doing nothing to prevent the recurrent matrix from amplifying them in the linear regime. It is the reason vanilla RNNs are conceptually elegant yet difficult to train on long-range dependencies.

Why not *eLU?

Exponential Linear Units and their variants are useful in modern feedforward and convolutional architectures, but they are not natural choices for a vanilla RNN hidden state.

The reason is structural rather than empirical. *eLU is unbounded above and asymmetric around zero, so it satisfies neither requirement (1) (signed) nor the dynamical part of requirement (2) (bounded state under repeated reinjection). For a quantity that will be fed back into itself indefinitely, $tanh$ provides a clearer and more controlled representation.

The same reasoning explains why $tanh$ reappears inside the LSTM: the cell’s candidate update is required to live in $[- 1, 1]$ by design, and $tanh$ matches that requirement exactly, whereas *eLU does not.

Deep Learning: Zero to Hero

Explorer

Recurrence as a Memory Mechanism

Vanilla RNN Cell

A Minimal PyTorch Implementation

Why tanh?

1) A Signed, Bounded State

2) A Nearly Linear Region Around Zero

3) Comparison With Sigmoid

4) Saturation: Useful and Dangerous

Graph View

Table of Contents

Backlinks