Softmax Properties

Consider an output layer with $C$ classes. Let

z^{L} = z_{1}^{L} z_{2}^{L} ⋮ z_{C}^{L} \in R^{C}

denote the vector of logits, and let

a^{L} = softmax (z^{L})

denote the output activations, whose components are

a_{j}^{L} = \frac{e ^{z_{j}^{L}}}{\sum _{m = 1}^{C} e ^{z_{m}^{L}}}, j = 1, \dots, C .

As explained in Softmax Loss, the vector $a^{L}$ is a probability distribution: each component is strictly positive and

j = 1 \sum C a_{j}^{L} = 1.

Monotonicity and Competition Between Classes

One of the most important qualitative properties of softmax is the following:

increasing the logit of one class increases its own output probability;
the same perturbation decreases the probabilities assigned to all the other classes.

This statement can be made precise by studying the partial derivatives $\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}}$ .

Explicit Derivative Formulas

The derivative of the $j$ -th softmax output with respect to the $k$ -th logit is

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = a_{j}^{L} (δ_{jk} - a_{k}^{L}),

where $δ_{jk}$ is the Kronecker delta. Therefore:

for $j = k$ ,

\frac{\partial a _{j}^{L}}{\partial z _{j}^{L}} = a_{j}^{L} (1 - a_{j}^{L}) > 0;

for $j \neq = k$ ,

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = - a_{j}^{L} a_{k}^{L} < 0.

These two formulas establish the monotonicity structure of the softmax layer:

increasing $z_{j}^{L}$ increases $a_{j}^{L}$ ;
increasing $z_{j}^{L}$ decreases every $a_{k}^{L}$ with $k \neq = j$ .

Intuition

The numerator $e^{z_{j}^{L}}$ tries to increase the weight assigned to class $j$ , but the denominator is shared by all classes. Since the total probability mass must remain equal to $1$ , any gain by one class must be compensated by losses elsewhere.

Important

The softmax layer therefore induces a genuine competition between classes. This is precisely the behavior needed in mutually exclusive multiclass classification: the classes do not make independent yes/no decisions, but compete for a single unit of probability mass.

Non-Locality of the Softmax Layer

For a sigmoid output unit, the activation is local:

a_{j}^{l} = σ (z_{j}^{l}) .

Each output depends only on its own pre-activation.

Softmax behaves differently. In an output layer,

a_{j}^{L} = \frac{e ^{z_{j}^{L}}}{\sum _{m = 1}^{C} e ^{z_{m}^{L}}},

so each output $a_{j}^{L}$ depends not only on $z_{j}^{L}$ , but on all logits $z_{1}^{L}, \dots, z_{C}^{L}$ through the common denominator.

This is why the softmax layer is non-local:

changing $z_{j}^{L}$ affects $a_{j}^{L}$ directly;
the same change also affects all other outputs indirectly.

The non-zero off-diagonal derivatives

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = - a_{j}^{L} a_{k}^{L}, j \neq = k,

are the differential signature of this non-local coupling.

Local Versus Non-Local Output Layers

A sigmoid output is local: each activation depends only on its own logit.

A softmax output is non-local: every activation depends on all logits in the layer.

This non-locality is exactly what allows the output to be interpreted as a categorical probability distribution.

Why the name “Soft-Max”

The origin of the name becomes transparent by introducing a sharpening parameter $β > 0$ (the inverse temperature):

a_{j}^{L} (β) = \frac{e ^{β z_{j}^{L}}}{\sum _{m = 1}^{C} e ^{β z_{m}^{L}}} .

The standard softmax corresponds to $β = 1$ .

Why “Soft-Max”?

For every $β > 0$ , the vector $a^{L} (β)$ is still a probability distribution, because $j \sum a_{j}^{L} (β) = 1, a_{j}^{L} (β) > 0.$

Let $m = k max z_{k}^{L} and M = {k : z_{k}^{L} = m}$ be the set of maximizers. Then $β \to \infty lim a_{j}^{L} (β) = {\frac{1}{∣ M ∣}, 0, j \in M, j \in / M .$

In particular, if the maximum is unique, say at class $j^{⋆}$ , then $β \to \infty lim a_{j}^{L} (β) = {1, 0, j = j^{⋆}, j \neq = j^{⋆} .$ In that case the limit is the one-hot output associated with a hardmax selector.

This explains the name: softmax is a differentiable, softened version of a hardmax or argmax-like selection mechanism. At finite $β$ , probability mass is distributed across all classes; as $β$ grows, the distribution becomes increasingly concentrated on the largest logit.

Equivalent Temperature Parameter

In modern machine learning practice, especially in the context of probabilistic decoding and large language models, the same family of transformations is usually parameterized by a temperature $T > 0$ rather than by the inverse temperature $β$ .

The two parameterizations are equivalent:
$T = \frac{1}{β} ⟺ β = \frac{1}{T} .$
Therefore the softmax may be written as
$a_{j}^{L} (T) = \frac{e ^{z_{j}^{L} / T}}{\sum _{m = 1}^{C} e ^{z_{m}^{L} / T}} .$
This notation is often more intuitive in applications:

as $T \to 0^{+}$ , the distribution becomes increasingly concentrated on the largest logit and approaches a hardmax distribution when the maximizer is unique;

as $T \to \infty$ , all logits are flattened relative to one another and the distribution approaches the uniform distribution over classes.

Thus, low temperature makes the output more peaked and more deterministic, whereas high temperature makes it flatter and more random. In modern large language models, temperature is exactly the parameter used at sampling time to control how concentrated or exploratory the next-token distribution should be.

The same idea can be illustrated numerically. Consider the logits

z^{L} = 2421 .

Then:

a hardmax produces the one-hot vector $[0, 1, 0, 0]$ ;
a naive normalization of the positive scores would produce approximately $[0.2222, 0.4444, 0.2222, 0.1111]$ ;
the softmax output is approximately $[0.1025, 0.7573, 0.1025, 0.0377]$ .

Softmax is therefore much closer to a hard winner-take-most rule than plain normalization, while remaining fully differentiable.

Non-Linearity

Softmax is not a linear map. In general,

softmax (α u + β v) \neq = α softmax (u) + β softmax (v) .

The reason is structural: every component contains an exponential in the numerator and a shared sum of exponentials in the denominator. Both operations destroy linearity.

A simple way to see the failure of homogeneity is to scale the logits. If

z = [12],

then

softmax (z) \approx [0.269 0.731], softmax (2 z) \approx [0.119 0.881] .

Doubling the logits does not double the probabilities. It sharpens the distribution by increasing the relative separation between the logits. Probability mass moves toward the largest component.

This is the same phenomenon described in the sharpening-parameter view above: scaling all logits by a positive factor does not preserve the distribution unless all logits are equal. It changes how concentrated the output is.

Permutation Equivariance

The softmax formula treats the coordinates of the logit vector symmetrically. No class index has a privileged role in the definition: each numerator is the exponential of one logit, and the denominator is the sum of the same exponentials over all coordinates.

Let $P$ be a $C \times C$ permutation matrix. Then

softmax (Pz) = P softmax (z) .

Permuting the logits only relabels the exponentials in the numerator, while the denominator remains the same sum, merely written in a different order. Consequently, if the logits are permuted, the output probabilities are permuted in exactly the same way.

Proof

Let $π$ be the permutation associated with $P$ under the convention
$(Pz)_{j} = z_{π (j)} .$
This means that the $j$ -th coordinate of the permuted vector $Pz$ is obtained by taking the $π (j)$ -th coordinate of the original vector $z$ . Equivalently, $P$ reorders the coordinates of $z$ according to $π$ .

With this convention, the $j$ -th component of $softmax (Pz)$ is
$[softmax (Pz)]_{j} = \frac{e ^{(Pz)_{j}}}{\sum _{m = 1}^{C} e ^{(Pz)_{m}}} = \frac{e ^{z_{π (j)}}}{\sum _{m = 1}^{C} e ^{z_{π (m)}}} .$
Since $π$ is a permutation, the denominator is the same sum with reordered terms:
$m = 1 \sum C e^{z_{π (m)}} = m = 1 \sum C e^{z_{m}} .$
Therefore
$[softmax (Pz)]_{j} = \frac{e ^{z_{π (j)}}}{\sum _{m = 1}^{C} e ^{z_{m}}} = [P softmax (z)]_{j} .$
Since the equality holds for every coordinate $j$ ,
$softmax (Pz) = P softmax (z) .$

For example, if

softmax 123 = p_{1} p_{2} p_{3},

then

softmax 312 = p_{3} p_{1} p_{2} .

The distinction from invariance matters. Softmax does not collapse the vector into a scalar quantity independent of coordinate order; it returns another vector whose coordinates still correspond to classes. A permutation of the input therefore induces the same permutation of the output.

Link with entropy of a random variable

This is closely related to the symmetry of entropy, but the type of symmetry is different. Using the notation introduced in Entropy: Definition and Properties, let $X$ be a discrete random variable defined over the alphabet $A_{X}$ with probability mass function $p_{X} (x)$ . Its entropy is
$H (X) = x \in A_{X} : p_{X} (x) \neq = 0 \sum p_{X} (x) lo g_{2} \frac{1}{p _{X} ( x )} .$
Entropy depends on the probability masses, not on the order in which the elements of the alphabet are listed. If the probabilities are collected into a vector
$p = p_{X} (x_{1}) ⋮ p_{X} (x_{C}),$
then reordering the alphabet corresponds to replacing $p$ by $Pp$ for some permutation matrix $P$ . The entropy is unchanged because the defining sum only reorders its terms:
$H (Pp) = H (p) .$
Entropy is therefore permutation invariant as a scalar functional of a distribution. Softmax is permutation equivariant because it maps one indexed vector to another indexed vector:
$softmax (Pz) = P softmax (z) .$

Equivariance, not invariance

A permutation-invariant map would satisfy $f (Pz) = f (z)$ . Softmax does not satisfy this in general. It satisfies the stronger coordinate-respecting statement $f (Pz) = P f (z)$ .

Shift Invariance and Numerical Stability

Softmax is invariant under the addition of the same constant to all logits:

softmax (z^{L} + α 1) = softmax (z^{L}) for every α \in R,

where $1$ denotes the all-ones vector. Indeed,

\frac{e ^{z_{j}^{L} + α}}{\sum _{m} e ^{z_{m}^{L} + α}} = \frac{e ^{α} e ^{z_{j}^{L}}}{e ^{α} \sum _{m} e ^{z_{m}^{L}}} = \frac{e ^{z_{j}^{L}}}{\sum _{m} e ^{z_{m}^{L}}} .

This property is mathematically important and computationally useful. In practice, one usually subtracts $max_{m} z_{m}^{L}$ before exponentiating:

a_{j}^{L} = \frac{e ^{z_{j}^{L} - m a x_{m} z_{m}^{L}}}{\sum _{k} e ^{z_{k}^{L} - m a x_{m} z_{m}^{L}}} .

The result is unchanged, but numerical overflow is avoided. This is one of the reasons why modern implementations compute softmax and cross-entropy from logits in carefully stabilized form.

Derivative of Softmax and Jacobian Structure

Since softmax maps $R^{C}$ to $R^{C}$ , its derivative is a Jacobian matrix:

J_{softmax} (z^{L}) = \frac{\partial a ^{L}}{\partial z ^{L}} = \frac{\partial a _{1}^{L}}{\partial z _{1}^{L}} \frac{\partial a _{2}^{L}}{\partial z _{1}^{L}} ⋮ \frac{\partial a _{C}^{L}}{\partial z _{1}^{L}} \frac{\partial a _{1}^{L}}{\partial z _{2}^{L}} \frac{\partial a _{2}^{L}}{\partial z _{2}^{L}} ⋮ \frac{\partial a _{C}^{L}}{\partial z _{2}^{L}} \dots \dots ⋱ \dots \frac{\partial a _{1}^{L}}{\partial z _{C}^{L}} \frac{\partial a _{2}^{L}}{\partial z _{C}^{L}} ⋮ \frac{\partial a _{C}^{L}}{\partial z _{C}^{L}} .

To compute a generic entry, write

a_{j}^{L} = \frac{g _{j}}{h}, g_{j} = e^{z_{j}^{L}}, h = m = 1 \sum C e^{z_{m}^{L}} .

Then

\frac{\partial h}{\partial z _{k}^{L}} = e^{z_{k}^{L}},

while

\frac{\partial g _{j}}{\partial z _{k}^{L}} = {e^{z_{j}^{L}}, 0, j = k, j \neq = k .

Applying the quotient rule gives

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = \frac{\frac{\partial g _{j}}{\partial z _{k}^{L}} h - g _{j} \frac{\partial h}{\partial z _{k}^{L}}}{h ^{2}} .

Diagonal Entries

If $j = k$ , then

\frac{\partial a _{j}^{L}}{\partial z _{j}^{L}} = \frac{e ^{z_{j}^{L}} \sum _{m} e ^{z_{m}^{L}} - e ^{z_{j}^{L}} e ^{z_{j}^{L}}}{( \sum _{m} e ^{z_{m}^{L}} ) ^{2}} = \frac{e ^{z_{j}^{L}}}{\sum _{m} e ^{z_{m}^{L}}} \cdot \frac{\sum _{m} e ^{z_{m}^{L}} - e ^{z_{j}^{L}}}{\sum _{m} e ^{z_{m}^{L}}} = a_{j}^{L} (1 - a_{j}^{L}) .

Off-Diagonal Entries

If $j \neq = k$ , then

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = \frac{0 \cdot \sum _{m} e ^{z_{m}^{L}} - e ^{z_{j}^{L}} e ^{z_{k}^{L}}}{( \sum _{m} e ^{z_{m}^{L}} ) ^{2}} = - \frac{e ^{z_{j}^{L}}}{\sum _{m} e ^{z_{m}^{L}}} \cdot \frac{e ^{z_{k}^{L}}}{\sum _{m} e ^{z_{m}^{L}}} = - a_{j}^{L} a_{k}^{L} .

Combining the two cases yields the compact formula

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = a_{j}^{L} (δ_{jk} - a_{k}^{L}) .

Equivalently, the same derivative may be written in piecewise form as

\frac{\partial a _{j}^{L}}{\partial z _{k}^{L}} = {a_{j}^{L} (1 - a_{j}^{L}), - a_{j}^{L} a_{k}^{L}, j = k, j \neq = k .

where the Kronecker delta is defined by

δ_{jk} = {1, 0, j = k, j \neq = k .

This is exactly the derivative used in Softmax Loss to prove that, under softmax plus negative log-likelihood,

δ_{j}^{L} = a_{j}^{L} - y_{j} .

Matrix Form of the Jacobian

If $a^{L}$ is viewed as a column vector, the Jacobian can be written compactly as

J_{softmax} (z^{L}) = diag (a^{L}) - a^{L} (a^{L})^{⊤} .

Several important consequences follow immediately:

the diagonal entries are positive;
the off-diagonal entries are non-positive;
each row sum is zero;
each column sum is zero;
the Jacobian is singular.

These facts encode the two defining structural constraints of softmax:

the outputs must always sum to $1$ ;
adding the same constant to all logits leaves the output unchanged.

Equivalently, softmax depends only on relative logits, not on their common offset.

Softmax in One Sentence

Softmax is a differentiable, non-local normalization map that transforms logits into a categorical probability distribution, induces competition between classes, and has Jacobian
$J_{softmax} (z^{L}) = diag (a^{L}) - a^{L} (a^{L})^{⊤} .$

Deep Learning: Zero to Hero

Explorer

Softmax Properties

Monotonicity and Competition Between Classes

Explicit Derivative Formulas

Non-Locality of the Softmax Layer

Why the name “Soft-Max”

Non-Linearity

Permutation Equivariance

Shift Invariance and Numerical Stability

Derivative of Softmax and Jacobian Structure

Diagonal Entries

Off-Diagonal Entries

Matrix Form of the Jacobian

Graph View

Table of Contents

Backlinks