Consider an output layer with classes. Let
denote the vector of logits, and let
denote the output activations, whose components are
As explained in Softmax Loss, the vector is a probability distribution: each component is strictly positive and
Monotonicity and Competition Between Classes
One of the most important qualitative properties of softmax is the following:
- increasing the logit of one class increases its own output probability;
- the same perturbation decreases the probabilities assigned to all the other classes.
This statement can be made precise by studying the partial derivatives .
Explicit Derivative Formulas
The derivative of the -th softmax output with respect to the -th logit is
where is the Kronecker delta. Therefore:
- for ,
- for ,
These two formulas establish the monotonicity structure of the softmax layer:
- increasing increases ;
- increasing decreases every with .
Intuition
The numerator tries to increase the weight assigned to class , but the denominator is shared by all classes. Since the total probability mass must remain equal to , any gain by one class must be compensated by losses elsewhere.
Important
The softmax layer therefore induces a genuine competition between classes. This is precisely the behavior needed in mutually exclusive multiclass classification: the classes do not make independent yes/no decisions, but compete for a single unit of probability mass.
Non-Locality of the Softmax Layer
For a sigmoid output unit, the activation is local:
Each output depends only on its own pre-activation.
Softmax behaves differently. In an output layer,
so each output depends not only on , but on all logits through the common denominator.
This is why the softmax layer is non-local:
- changing affects directly;
- the same change also affects all other outputs indirectly.
The non-zero off-diagonal derivatives
are the differential signature of this non-local coupling.
Local Versus Non-Local Output Layers
- A sigmoid output is local: each activation depends only on its own logit.
- A softmax output is non-local: every activation depends on all logits in the layer.
- This non-locality is exactly what allows the output to be interpreted as a categorical probability distribution.
Why the name “Soft-Max”
The origin of the name becomes transparent by introducing a sharpening parameter :
The standard softmax corresponds to .
Why “Soft-Max”?
- For every , the vector is still a probability distribution, because
- Let be the set of maximizers. Then
- In particular, if the maximum is unique, say at class , then In that case the limit is the one-hot output associated with a hardmax selector.
This explains the name: softmax is a differentiable, softened version of a hardmax or argmax-like selection mechanism. At finite , probability mass is distributed across all classes; as grows, the distribution becomes increasingly concentrated on the largest logit.
Equivalent Temperature Parameter
In modern machine learning practice, especially in the context of probabilistic decoding and large language models, the same family of transformations is usually parameterized by a temperature rather than by a sharpening parameter .
The two parameterizations are equivalent:
Therefore the softmax may be written as
This notation is often more intuitive in applications:
- as , the distribution becomes increasingly concentrated on the largest logit and approaches a hardmax distribution when the maximizer is unique;
- as , all logits are flattened relative to one another and the distribution approaches the uniform distribution over classes.
Thus, low temperature makes the output more peaked and more deterministic, whereas high temperature makes it flatter and more random. In modern large language models, temperature is exactly the parameter used at sampling time to control how concentrated or exploratory the next-token distribution should be.
The same idea can be illustrated numerically. Consider the logits
Then:
- a hardmax produces the one-hot vector ;
- a naive normalization of the positive scores would produce approximately ;
- the softmax output is approximately .
Softmax is therefore much closer to a hard winner-take-most rule than plain normalization, while remaining fully differentiable.
Shift Invariance and Numerical Stability
Softmax is invariant under the addition of the same constant to all logits:
where denotes the all-ones vector. Indeed,
This property is mathematically important and computationally useful. In practice, one usually subtracts before exponentiating:
The result is unchanged, but numerical overflow is avoided. This is one of the reasons why modern implementations compute softmax and cross-entropy from logits in carefully stabilized form.
Derivative of Softmax and Jacobian Structure
Since softmax maps to , its derivative is a Jacobian matrix:
To compute a generic entry, write
Then
while
Applying the quotient rule gives
Diagonal Entries
If , then
Off-Diagonal Entries
If , then
Combining the two cases yields the compact formula
Equivalently, the same derivative may be written in piecewise form as
where the Kronecker delta is defined by
This is exactly the derivative used in Softmax Loss to prove that, under softmax plus negative log-likelihood,
Matrix Form of the Jacobian
If is viewed as a column vector, the Jacobian can be written compactly as
Several important consequences follow immediately:
- the diagonal entries are positive;
- the off-diagonal entries are non-positive;
- each row sum is zero;
- each column sum is zero;
- the Jacobian is singular.
These facts encode the two defining structural constraints of softmax:
- the outputs must always sum to ;
- adding the same constant to all logits leaves the output unchanged.
Equivalently, softmax depends only on relative logits, not on their common offset.
Softmax in One Sentence
Softmax is a differentiable, non-local normalization map that transforms logits into a categorical probability distribution, induces competition between classes, and has Jacobian