Consider an output layer with classes. Let
denote the vector of logits, and let
denote the output activations, whose components are
As explained in Softmax Loss, the vector is a probability distribution: each component is strictly positive and
Monotonicity and Competition Between Classes
One of the most important qualitative properties of softmax is the following:
- increasing the logit of one class increases its own output probability;
- the same perturbation decreases the probabilities assigned to all the other classes.
This statement can be made precise by studying the partial derivatives .
Explicit Derivative Formulas
The derivative of the -th softmax output with respect to the -th logit is
where is the Kronecker delta. Therefore:
- for ,
- for ,
These two formulas establish the monotonicity structure of the softmax layer:
- increasing increases ;
- increasing decreases every with .
Intuition
The numerator tries to increase the weight assigned to class , but the denominator is shared by all classes. Since the total probability mass must remain equal to , any gain by one class must be compensated by losses elsewhere.
Important
The softmax layer therefore induces a genuine competition between classes. This is precisely the behavior needed in mutually exclusive multiclass classification: the classes do not make independent yes/no decisions, but compete for a single unit of probability mass.
Non-Locality of the Softmax Layer
For a sigmoid output unit, the activation is local:
Each output depends only on its own pre-activation.
Softmax behaves differently. In an output layer,
so each output depends not only on , but on all logits through the common denominator.
This is why the softmax layer is non-local:
- changing affects directly;
- the same change also affects all other outputs indirectly.
The non-zero off-diagonal derivatives
are the differential signature of this non-local coupling.
Local Versus Non-Local Output Layers
- A sigmoid output is local: each activation depends only on its own logit.
- A softmax output is non-local: every activation depends on all logits in the layer.
- This non-locality is exactly what allows the output to be interpreted as a categorical probability distribution.
Why the name “Soft-Max”
The origin of the name becomes transparent by introducing a sharpening parameter (the inverse temperature):
The standard softmax corresponds to .
Why “Soft-Max”?
- For every , the vector is still a probability distribution, because
- Let be the set of maximizers. Then
- In particular, if the maximum is unique, say at class , then In that case the limit is the one-hot output associated with a hardmax selector.
This explains the name: softmax is a differentiable, softened version of a hardmax or argmax-like selection mechanism. At finite , probability mass is distributed across all classes; as grows, the distribution becomes increasingly concentrated on the largest logit.
Equivalent Temperature Parameter
In modern machine learning practice, especially in the context of probabilistic decoding and large language models, the same family of transformations is usually parameterized by a temperature rather than by the inverse temperature .
The two parameterizations are equivalent:
Therefore the softmax may be written as
This notation is often more intuitive in applications:
- as , the distribution becomes increasingly concentrated on the largest logit and approaches a hardmax distribution when the maximizer is unique;
- as , all logits are flattened relative to one another and the distribution approaches the uniform distribution over classes.
Thus, low temperature makes the output more peaked and more deterministic, whereas high temperature makes it flatter and more random. In modern large language models, temperature is exactly the parameter used at sampling time to control how concentrated or exploratory the next-token distribution should be.
The same idea can be illustrated numerically. Consider the logits
Then:
- a hardmax produces the one-hot vector ;
- a naive normalization of the positive scores would produce approximately ;
- the softmax output is approximately .
Softmax is therefore much closer to a hard winner-take-most rule than plain normalization, while remaining fully differentiable.
Non-Linearity
Softmax is not a linear map. In general,
The reason is structural: every component contains an exponential in the numerator and a shared sum of exponentials in the denominator. Both operations destroy linearity.
A simple way to see the failure of homogeneity is to scale the logits. If
then
Doubling the logits does not double the probabilities. It sharpens the distribution by increasing the relative separation between the logits. Probability mass moves toward the largest component.
This is the same phenomenon described in the sharpening-parameter view above: scaling all logits by a positive factor does not preserve the distribution unless all logits are equal. It changes how concentrated the output is.
Permutation Equivariance
The softmax formula treats the coordinates of the logit vector symmetrically. No class index has a privileged role in the definition: each numerator is the exponential of one logit, and the denominator is the sum of the same exponentials over all coordinates.
Let be a permutation matrix. Then
Permuting the logits only relabels the exponentials in the numerator, while the denominator remains the same sum, merely written in a different order. Consequently, if the logits are permuted, the output probabilities are permuted in exactly the same way.
Proof
Let be the permutation associated with under the convention
This means that the -th coordinate of the permuted vector is obtained by taking the -th coordinate of the original vector . Equivalently, reorders the coordinates of according to .
With this convention, the -th component of is
Since is a permutation, the denominator is the same sum with reordered terms:
Therefore
Since the equality holds for every coordinate ,
For example, if
then
The distinction from invariance matters. Softmax does not collapse the vector into a scalar quantity independent of coordinate order; it returns another vector whose coordinates still correspond to classes. A permutation of the input therefore induces the same permutation of the output.
Link with entropy of a random variable
This is closely related to the symmetry of entropy, but the type of symmetry is different. Using the notation introduced in Entropy: Definition and Properties, let be a discrete random variable defined over the alphabet with probability mass function . Its entropy is
Entropy depends on the probability masses, not on the order in which the elements of the alphabet are listed. If the probabilities are collected into a vector
then reordering the alphabet corresponds to replacing by for some permutation matrix . The entropy is unchanged because the defining sum only reorders its terms:
Entropy is therefore permutation invariant as a scalar functional of a distribution. Softmax is permutation equivariant because it maps one indexed vector to another indexed vector:
Equivariance, not invariance
A permutation-invariant map would satisfy . Softmax does not satisfy this in general. It satisfies the stronger coordinate-respecting statement .
Shift Invariance and Numerical Stability
Softmax is invariant under the addition of the same constant to all logits:
where denotes the all-ones vector. Indeed,
This property is mathematically important and computationally useful. In practice, one usually subtracts before exponentiating:
The result is unchanged, but numerical overflow is avoided. This is one of the reasons why modern implementations compute softmax and cross-entropy from logits in carefully stabilized form.
Derivative of Softmax and Jacobian Structure
Since softmax maps to , its derivative is a Jacobian matrix:
To compute a generic entry, write
Then
while
Applying the quotient rule gives
Diagonal Entries
If , then
Off-Diagonal Entries
If , then
Combining the two cases yields the compact formula
Equivalently, the same derivative may be written in piecewise form as
where the Kronecker delta is defined by
This is exactly the derivative used in Softmax Loss to prove that, under softmax plus negative log-likelihood,
Matrix Form of the Jacobian
If is viewed as a column vector, the Jacobian can be written compactly as
Several important consequences follow immediately:
- the diagonal entries are positive;
- the off-diagonal entries are non-positive;
- each row sum is zero;
- each column sum is zero;
- the Jacobian is singular.
These facts encode the two defining structural constraints of softmax:
- the outputs must always sum to ;
- adding the same constant to all logits leaves the output unchanged.
Equivalently, softmax depends only on relative logits, not on their common offset.
Softmax in One Sentence
Softmax is a differentiable, non-local normalization map that transforms logits into a categorical probability distribution, induces competition between classes, and has Jacobian