CNNs and MLPs limitations

Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) are not limited because they are weak function approximators. In principle, sufficiently large feed-forward networks can approximate very complex input-output maps.

Their limitation is architectural. Standard MLPs and CNNs are naturally designed around fixed-size input representations. Sequential data, by contrast, are ordered collections whose length may vary from example to example, and whose meaning often depends on temporal structure. The problem is therefore not expressivity in the abstract, but the mismatch between the architecture and the structure of the data.

Sequential Data as Variable-Length Objects

A sequence can be written as

x_{1 : T} = (x_{1}, x_{2}, \dots, x_{T}),

where each element $x_{t} \in R^{d}$ represents the observation at position or time step $t$ , and $T$ denotes the sequence length.

Example: a trajectory as a sequence

Consider the trajectory of a moving point observed at three consecutive time steps. If each observation records the horizontal and vertical position of the point, the sequence can be written as
$x_{1 : 3} = (x_{1}, x_{2}, x_{3}),$
where, for example,
$x_{1} = (2.0, 1.0), x_{2} = (2.5, 1.4), x_{3} = (3.1, 1.9) .$
In this case, each $x_{t} \in R^{2}$ contains two coordinates, so $d = 2$ , and the sequence length is $T = 3$ .

The crucial point is that $T$ is not generally fixed. A sentence may contain five words or fifty words. An audio clip may last one second or ten seconds. A trajectory may contain a short or long history of observed positions. A video may contain a variable number of frames.

Therefore, a sequence model should ideally define a computation that remains meaningful for arbitrary lengths:

f_{θ} : T \geq 1 ⋃ (R^{d})^{T} ⟶ Y .

The formula looks more abstract than it is. Read piece by piece:

$(R^{d})^{T}$ is the space of sequences of exactly length $T$ , where each of the $T$ elements is a vector in $R^{d}$ . Concretely, $(R^{d})^{1}$ is just $R^{d}$ , $(R^{d})^{2}$ is the space of pairs of $d$ -dimensional vectors, and so on.
$⋃_{T \geq 1}$ is a union: it collects together the space of length- $1$ sequences, the space of length- $2$ sequences, the space of length- $3$ sequences, and so on, into a single domain.
$f_{θ}$ takes as input any sequence drawn from this union, with the length $T$ allowed to vary from example to example, and produces an output in $Y$ . The output space $Y$ depends on the task: a single label, a probability vector, another sequence, and so on.

The core difficulty is that this input space is not a single vector space of fixed dimension, but a union of spaces of different dimensions. A function defined on $R^{10}$ has no canonical extension to $R^{50}$ : it does not know what to do with the extra $40$ coordinates, or which $10$ of the $50$ to look at. A sequence model needs a different kind of construction, one whose definition does not depend on a fixed input dimension.

Fixed-size data versus variable-length data

Images are often treated as fixed-size tensors because they can be resized or cropped to a common grid such as $224 \times 224 \times 3$ . This does not mean that images are inherently simple; rather, it means that computer-vision pipelines usually force them into a common spatial shape before processing.

Sequential data are different. Their length is part of their structure. Padding or truncating a sequence may be useful in implementation, but it is an external workaround, not a native architectural solution.

Why plain MLPs are not suitable for sequences

An MLP expects an input vector of fixed dimension:

x \in R^{N} .

To feed a sequence into an MLP, one must first force it into a fixed-size vector. A common strategy is to choose a maximum length $T_{m a x}$ , pad shorter sequences, truncate longer ones, and flatten the result:

x_{1 : T} ⟶ \tilde{x} \in R^{T_{m a x} d} .

This procedure creates several problems:

the architecture is tied to the arbitrary choice of $T_{m a x}$ ;
information may be lost when sequences longer than $T_{m a x}$ are truncated;
computation is wasted on padding for shorter sequences;
temporal order is present only through coordinate position, not through a dedicated sequential mechanism.

The MLP can still learn useful patterns if the dataset is simple enough, the maximum length is small enough, and enough data are available. However, the architecture does not make the sequential structure native. It treats the sequence primarily as a large fixed vector.

There is a structural issue that runs deeper than the fixed-size requirement and that survives even if the length is somehow fixed in advance. When a sequence is flattened into a vector $\tilde{x} \in R^{T_{m a x} d}$ , the entry at position $t$ is multiplied by a completely separate set of weights from the entry at position $t^{'} \neq = t$ . The MLP has no built-in mechanism that ties the parameters acting on different positions together.

The consequence is concrete. A pattern that means “verb”, “phoneme /a/”, or “price spike” must be learned independently at every position it can appear in. If a verb appears in position $3$ of a sentence during training and the same verb appears in position $7$ of a test sentence, the MLP treats the two occurrences as unrelated unless the training set happens to contain enough examples of that same verb at position $7$ as well. The model has no architectural prior that says “a recurring phenomenon at different positions should be learned with the same parameters”.

This is the absence of parameter sharing across positions, and it is the architectural prior that the two natural alternatives to the MLP each provide along a different axis:

CNNs share the same kernel across all spatial positions of the input. A filter that detects an edge at pixel $(i, j)$ is the same filter that detects an edge at pixel $(i^{'}, j^{'})$ . Position invariance is built into the architecture.
RNNs share the same cell across all temporal positions of the input. The transformation applied to $(x_{t}, h_{t - 1})$ is the same transformation applied to $(x_{t + 1}, h_{t})$ . Position invariance along the time axis is also built into the architecture.

Both choices embed an invariance the MLP lacks: a recurring phenomenon at different positions is learned with the same parameters. This single architectural absence is the deepest reason why MLPs need orders of magnitude more data than CNNs or RNNs to generalize on sequential tasks: the MLP has to re-learn at every position what the other architectures learn once.

The Statelessness Problem

There is a second, deeper limitation. Standard MLPs and standard feed-forward CNNs are stateless architectures. Given an input, they compute an output through a fixed sequence of layers:

\hat{y} = f_{θ} (x) .

Once this computation is complete, the model does not retain an internal record of what it has processed. If another input arrives later, the network starts a new forward pass from scratch. Any information about previous inputs must be explicitly included again in the current input representation.

No intrinsic memory

A feed-forward MLP or CNN does not naturally remember previous elements of a sequence. It can only use past information if that information has already been packed into the input given to the model.

This is precisely what makes sequence modeling difficult: in many tasks, the meaning of the current element depends on what came before. A model that has no internal mechanism for carrying information forward must rely on external preprocessing, fixed windows, padding, truncation, or hand-designed context representations.

This distinction is essential:

the fixed-size problem concerns the shape of the input.
the statelessness problem concerns the dynamics of the computation. Even if a sequence is forced into a fixed-size tensor, a feed-forward architecture still does not possess a native mechanism for progressively accumulating context as the sequence unfolds.

What CNNs Improve and Where They Still Fall Short

CNNs improve on MLPs by introducing a strong structural prior: local patterns are detected with filters that slide across the input. This idea is not restricted to images. One-dimensional convolutions can be applied to text, audio, time series, and other sequential signals.

For example, given a sequence

(x_{1}, x_{2}, \dots, x_{T}),

a 1D convolution can detect local motifs such as short word patterns, phonetic fragments, local waveform shapes, or short-term temporal trends.

What CNNs handle well

CNNs are useful when the relevant signal is local or compositional: nearby elements form short patterns, short patterns compose into larger patterns, and translation of a pattern along the sequence should not completely change its meaning.

However, standard feed-forward CNNs still have important limitations for general sequence modeling:

they process sequences through a finite receptive field determined by kernel sizes, depth, dilation, and pooling;
long-range interactions require either many layers, large kernels, dilation schemes, or additional global aggregation mechanisms;
input and output alignment must be designed explicitly;
variable-length sequence-to-sequence problems require extra machinery beyond a plain convolutional stack.

Thus, CNNs are not categorically unable to process sequential data. They can be very strong sequence models in many settings. The more precise claim is that standard CNNs are naturally biased toward local pattern extraction, whereas many sequence problems require a more flexible mechanism for accumulating and transforming context over time.

This is the mechanism that recurrent neural networks provide: a hidden state that is updated step by step, and a shared cell that transforms it. The recurrent architecture closes all three gaps identified in this note at once. It handles variable-length input natively (each step is a separate application of the same cell), it carries an explicit internal state across steps (no statelessness), and it shares its parameters across all temporal positions (the same cell is applied at every $t$ ).

Before turning to the architecture itself, the next note, Taxonomy of Sequential Problems, maps out the range of input-output shapes a sequence model must handle. The simplest recurrent architecture is then formalized in Vanilla RNN.

Deep Learning: Zero to Hero

Explorer

CNNs and MLPs limitations

Sequential Data as Variable-Length Objects

Why plain MLPs are not suitable for sequences

The Statelessness Problem

What CNNs Improve and Where They Still Fall Short

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

CNNs and MLPs limitations

Sequential Data as Variable-Length Objects

Why plain MLPs are not suitable for sequences

The Parameter-Sharing Problem

The Statelessness Problem

What CNNs Improve and Where They Still Fall Short

Graph View

Table of Contents

Backlinks