Why computational graphs are indispensable

Backpropagation is often described as if it were a special trick invented specifically for neural networks. That description misses the deeper point.

What backpropagation really exploits is the fact that a neural network is a composite computation: its output is not produced in one step, but through a long sequence of elementary operations. A computational graph is the representation that makes this compositional structure explicit.

Important

A computational graph turns a complicated function into a network of simple local operations. Once that structure is visible, differentiation becomes an organized application of the chain rule rather than an opaque symbolic manipulation.

This perspective is fundamental because it reveals that:

  • neural networks are not exceptional from the viewpoint of differentiation;
  • what matters is not the model name, but the structure of the computation;
  • backpropagation is best understood as reverse-mode automatic differentiation on a computational graph.

From formula to graph

The equivalence between a compact formula and a graph-ready factorization is easiest to see side by side.

Compact expressionGraph-ready factorization




This factorization is precisely what the computational graph records. The repeated colors are used only to help the eye track the same quantities across the two representations.

The distinction is not cosmetic:

  • the compact formula emphasizes the final expression;
  • the graph emphasizes the dependency structure of the computation.

Differentiation depends on that dependency structure.


What a computational graph is

A computational graph is a directed graph describing how outputs are produced from inputs through elementary operations.

In the convention adopted here:

  • nodes represent operations such as multiplication, addition, or sigmoid;
  • edges represent the flow of numerical values from one operation to the next;
  • leaves typically correspond to external quantities such as inputs, parameters, or constants;
  • the root is the final quantity of interest, which in learning is usually a scalar loss.

Conventions vary

Some texts place variables at nodes and operations on edges. Others, including the diagram below, use the opposite convention. Nothing essential depends on that choice. What matters is that the graph faithfully captures which quantities depend on which others.

For a single forward evaluation, the graph is normally acyclic: every node depends only on values already available upstream.
This acyclic structure allows both computation and differentiation to be organized by topological order.

Reading the figure

The diagram represents a single neuron with three inputs.

  • Each mult node computes a product .
  • The add node computes the pre-activation
  • The sigmoid node computes
  • The red expressions indicate local derivatives provided by individual nodes.

Strictly speaking, training usually appends one more node on top of , namely a scalar loss

because the true training objective is not the activation itself but the loss.
However, the graph shown above is already sufficient to understand the key local mechanisms.


The principle of locality

The decisive advantage of a computational graph is that each node only needs local knowledge.

For a node computing

two ingredients are required:

  1. the forward rule: how to compute from ;
  2. the local derivative rules: how changes when one input changes.

No node needs to understand the entire graph.

For the neuron above, the local rules are:

NodeForward ruleLocal derivatives
mult
add
sigmoid

Important

The global derivative is not stored anywhere in the graph. It emerges by composing local derivatives. This locality is the entire reason computational graphs scale.

Once a large computation has been decomposed into primitives with known local derivative rules, differentiation becomes modular.


The chain rule on a graph

The graph makes it possible to see exactly how a quantity influences another.

Suppose the goal is to compute the effect of on the final activation .
The graph reveals the dependency chain immediately:

The chain rule then gives

Substituting the local derivatives,

If a scalar loss is attached on top, the same pattern simply extends one step further:

The chain rule is therefore not an external add-on to the graph.
It is the mathematical law governing how sensitivity propagates through the graph.


Branching, merging, and gradient accumulation

The previous example follows a single chain of dependence. Real graphs are more general: a variable may influence the final output through several downstream routes.

This is the point at which many informal explanations become too vague.
The essential fact is the following:

Accumulation rule

When a variable affects the final output through multiple children, its total derivative is the sum of the contributions coming from all those children.

To see why, consider

Then influences through two branches, one through and one through .
The chain rule gives

Equation is the origin of gradient accumulation.

In graph language, if a node value is used by several child nodes , then

This formula is one of the central equations of backpropagation.
It states that the total sensitivity of the final scalar to an intermediate quantity is obtained by summing the sensitivities transmitted back from every immediate downstream use of .

Example

Consider

Then

The derivative is a sum because the influence of reaches through two distinct downstream branches.

This is why backpropagation is not merely “multiply derivatives along a path.”
That statement is incomplete. The correct picture is:

  • multiply local derivatives along each path;
  • sum the contributions across all paths.

Adjoint variables and the backward pass

For a scalar output of interest , define for every intermediate quantity the adjoint

This quantity is also called the upstream gradient at node .

Info

The expression upstream is used relative to the backward pass: it is the gradient that arrives at node from the part of the graph closer to the final scalar objective .

Equivalently, measures how sensitive is to an infinitesimal change in , after accounting for everything that happens downstream of .

The benefit of adjoints is conceptual and algorithmic:

  • conceptually, summarizes the total downstream influence of on the final scalar ;
  • algorithmically, once is known, there is no need to revisit all downstream paths separately.

Suppose a node computes

where are its parents.
Then the local backward contribution from to one parent is

Local backpropagation rule

Equation is the local backpropagation rule. It is simply the chain rule written in a reusable local form.

Applied once for every child of a node and then accumulated, it reproduces Equation .

For a scalar output, the backward pass starts from

and propagates these adjoints through the graph in reverse topological order. That ordering guarantees that when a node is processed, all adjoint information coming from its children has already been computed and can be accumulated correctly.

Core operational picture

The forward pass computes values. The backward pass computes adjoints. Backpropagation is the systematic propagation of adjoints through the graph using local derivative rules.

For learning problems, the scalar is typically the loss .
Thus, the backward pass produces quantities such as

which are exactly the gradients required by gradient-based optimization.


Why backpropagation is efficient

Scope of this section

The present section explains why backpropagation is structurally efficient when viewed as graph-based reverse-mode differentiation. The note On the efficiency of BP addresses a different question: how this efficiency compares, at the level of computational cost, with naive gradient estimation methods such as finite differences.

At a purely conceptual level, one may imagine computing a derivative by enumerating every path from a parameter to the loss, multiplying local derivatives along each path, and then summing over all paths.
That intuition is useful, but as an algorithm it is disastrous: in large graphs, the number of paths can grow explosively.

Backpropagation avoids this combinatorial explosion through reuse.

The adjoint is a cached summary of all downstream paths from to the output .
Once this quantity has been computed, all downstream complexity has been compressed into a single number, vector, or tensor.
There is no need to recompute that downstream influence separately for every ancestor.

This is a dynamic-programming viewpoint:

  • the forward pass computes and stores intermediate values;
  • the backward pass computes and stores intermediate adjoints;
  • each node is processed once in the forward direction and once in the reverse direction.

Success

Backpropagation is efficient not because the chain rule is simple, but because the chain rule is organized so that shared subcomputations are reused rather than recomputed.

This explains the structural source of the speedup.
The dedicated note On the efficiency of BP then turns that structural insight into an explicit cost comparison against naive numerical differentiation.


Backpropagation and automatic differentiation

Computational graphs belong naturally within the broader subject of automatic differentiation.

It is useful to distinguish four notions that are often conflated:

MethodMain ideaNature of resultMain limitation
Manual differentiationDerive formulas by handExact symbolic derivativeImpossible to scale to large programs
Symbolic differentiationManipulate algebraic expressions symbolicallyExact symbolic derivativeExpression swell and poor fit to real code
Numerical differentiationUse finite differencesApproximate derivativeSlow and numerically fragile
Automatic differentiationDifferentiate the executed computation by composing local derivativesExact up to floating-point arithmeticRequires differentiable local primitives

Automatic differentiation does not approximate derivatives by finite differences, and it does not require a giant closed-form symbolic derivative.
Instead, it differentiates the actual computation that was executed, operation by operation.

This is exactly the role played by the computational graph.

Important

Backpropagation is reverse-mode automatic differentiation specialized to the common deep-learning regime in which a scalar loss depends on many parameters.


Why reverse mode is the right mode for learning

Automatic differentiation has more than one mode.

For a function

the two most important modes are:

  • forward mode, which propagates sensitivities from inputs to outputs;
  • reverse mode, which propagates sensitivities from outputs back to inputs.

The asymmetry is crucial.

  • Forward mode is natural when the number of inputs of interest is small.
  • Reverse mode is natural when the number of outputs of interest is small.

Neural network training is almost perfectly aligned with reverse mode because the relevant function has the form

where:

  • contains potentially millions or billions of parameters;
  • the output is a single scalar loss.

To recover the full gradient with respect to all parameters:

  • forward mode would require propagating one seeded sensitivity direction per parameter in order to reconstruct the entire gradient;
  • reverse mode starts from the single scalar output and recovers all parameter derivatives in one backward sweep.

Important

This is why backpropagation is the natural differentiation strategy for deep learning: the output dimension is usually , while the parameter dimension is enormous.

This input-output asymmetry is the structural reason reverse mode dominates gradient-based learning.


Scalar picture versus tensor reality

The preceding discussion used scalar quantities because scalars make the logic maximally transparent.
Real deep-learning systems, however, operate mostly on vectors, matrices, and higher-order tensors.

Conceptually, nothing essential changes, but the notation must be handled with more care.

Suppose a local operation takes an input tensor and produces an output tensor :

To connect this with standard multivariable calculus, one may temporarily imagine flattening into a vector in and into a vector in .
At that point, the derivative of at the current point is represented by its Jacobian matrix

whose entries are

In other words, the Jacobian collects all first-order partial derivatives of the output components with respect to the input components.

The Jacobian is important because it is the linear object that locally approximates the operation:

Note

This is the tensor-valued analogue of the ordinary derivative of a scalar function: it describes how a small perturbation in the input propagates to a first-order perturbation in the output.

However, reverse-mode differentiation does not usually need the full Jacobian explicitly.

What is given during the backward pass is an upstream gradient

where is the final scalar objective, typically the loss.
This is the gradient arriving at the present operation from the side of the graph nearer to .
What is needed is the corresponding gradient with respect to the input tensor .

Using the row-gradient convention adopted in this note, the chain rule gives

This operation is a vector-Jacobian product: an upstream gradient is contracted with the Jacobian of the local operation to produce the downstream input gradient.

About transpose conventions

Many texts use column-vector gradients instead. In that convention the same rule is written as

The mathematics is the same; only the orientation convention changes.

Why autodiff systems remain practical

Modern frameworks do not explicitly materialize gigantic Jacobian tensors for every layer. Each primitive operation implements a local backward rule that maps an upstream gradient to the corresponding input gradients. This preserves the locality principle even in high-dimensional tensor computations.

For example:

  • matrix multiplication implements a backward rule using transposed matrices rather than a gigantic explicit Jacobian;
  • convolution layers use structured local backward operators rather than flattening everything into an enormous dense matrix;
  • elementwise activations act through simple diagonal Jacobian structure, which can be applied without ever building the diagonal matrix itself.

Thus, the scalar graph should be interpreted as the clean conceptual prototype of the tensor case, not as a toy disconnected from real systems.


From mathematical graph to framework graph

Up to this point, the discussion has focused on the mathematical computational graph: the dependency structure required by the chain rule. Modern deep-learning frameworks implement that same idea, but not always in the same architectural way.

Two distinctions are especially important.

Dynamic versus staged graphs

In some systems, the graph is built as the program executes. This is the define-by-run style associated with frameworks such as PyTorch. The executed operations themselves reveal the graph, and the backward pass is attached to the concrete computation that actually happened.

In other systems, the computation is first represented in a more explicit intermediate form and then transformed, optimized, or compiled before execution. This staged or compiled style appears in systems such as JAX/XLA and in graph-compilation pipelines more generally.

Info

The mathematical object is the same in both cases: a differentiable computation with a dependency structure. What changes is when and how that structure is materialized, optimized, and executed.

So the phrase computational graph can refer to more than one layer of reality:

  • the conceptual graph needed by the chain rule;
  • the internal graph representation used by a framework;
  • the optimized or compiled graph that is actually scheduled on hardware.

These layers are related, but they are not always identical.

Graph structure versus execution strategy

The clean graph drawn on paper is not necessarily the graph executed literally by the framework.

For example:

  • several primitive operations may be fused into a single kernel to reduce memory traffic and launch overhead;
  • the backward pass may recompute some forward activations instead of storing all of them, in order to save memory;
  • the graph may be rewritten internally into a more hardware-efficient form before execution.

Memory is often the real bottleneck

The conceptual description of backpropagation says: store the intermediate quantities needed for the backward pass. In modern large-scale models, however, storing all activations can be prohibitively expensive. This is why practical systems often use recomputation strategies such as gradient checkpointing: some intermediate values are deliberately discarded during the forward pass and recomputed later during the backward pass, trading extra compute for lower memory usage.

This does not alter the mathematical content of backpropagation. It changes the execution strategy used to realize that mathematics under real hardware constraints.


What can and cannot be backpropagated through

Backpropagation applies whenever the computation can be decomposed into local operations that provide valid derivative rules.

This requirement is sometimes summarized carelessly by saying that “the whole graph must be differentiable.”
A more precise statement is preferable.

Differentiability requirement

Backpropagation requires each local operation to admit a usable local derivative rule, at least almost everywhere or in the generalized sense employed by the framework.

This is why piecewise-differentiable operations such as ReLU or max can still be trained successfully: although they are not differentiable everywhere in the classical sense, they still admit workable local backward rules.

By contrast, operations that sever the computational dependency in a genuinely non-differentiable way, such as hard discrete choices or sampling steps without reparameterization, do not support ordinary gradient flow through the naive graph.

In practice, modern machine learning sometimes works around this limitation through surrogate constructions.

Examples include:

  • reparameterization tricks, which rewrite stochastic nodes in differentiable form
  • straight-through estimators, which deliberately use a backward rule that does not coincide with the true derivative of the forward operation.

Warning

Such techniques should not be confused with ordinary backpropagation through a genuinely differentiable graph. They are pragmatic extensions or approximations introduced precisely because the naive graph would otherwise block gradient flow.

So the correct criterion is not whether something is “a neural network,” but whether the graph preserves a differentiable pathway from the scalar objective back to the parameters.


Final synthesis

A computational graph should be understood as the natural language in which backpropagation becomes transparent.

It provides, all at once:

  • a decomposition of a complicated computation into elementary local operations;
  • a principled way to attach local derivative rules to those operations;
  • a rigorous explanation of why gradients multiply along paths and add across branches;
  • an algorithmic framework in which intermediate sensitivities are reused efficiently;
  • and the exact bridge from neural network training to reverse-mode automatic differentiation.

The logic of the full procedure can be summarized compactly:

  1. express the forward computation as a graph of simple differentiable primitives;
  2. evaluate the graph forward and cache the intermediate values needed by local derivative rules;
  3. initialize the adjoint of the scalar objective to ;
  4. traverse the graph backward, accumulating adjoints using the local chain-rule update;
  5. read off the resulting derivatives with respect to every parameter.

Important

Backpropagation is not best thought of as a mysterious neural-network-specific invention. It is reverse-mode automatic differentiation performed on a computational graph. Once this viewpoint is internalized, the apparent complexity of gradient-based learning becomes a disciplined consequence of composition, locality, and the chain rule.