Consider an output layer with classes. Let

denote the vector of logits, and let

denote the output activations, whose components are

As explained in Softmax Loss, the vector is a probability distribution: each component is strictly positive and

Monotonicity and Competition Between Classes

One of the most important qualitative properties of softmax is the following:

  • increasing the logit of one class increases its own output probability;
  • the same perturbation decreases the probabilities assigned to all the other classes.

This statement can be made precise by studying the partial derivatives .

Explicit Derivative Formulas

The derivative of the -th softmax output with respect to the -th logit is

where is the Kronecker delta. Therefore:

  • for ,
  • for ,

These two formulas establish the monotonicity structure of the softmax layer:

  • increasing increases ;
  • increasing decreases every with .

Intuition

The numerator tries to increase the weight assigned to class , but the denominator is shared by all classes. Since the total probability mass must remain equal to , any gain by one class must be compensated by losses elsewhere.

Important

The softmax layer therefore induces a genuine competition between classes. This is precisely the behavior needed in mutually exclusive multiclass classification: the classes do not make independent yes/no decisions, but compete for a single unit of probability mass.

Non-Locality of the Softmax Layer

For a sigmoid output unit, the activation is local:

Each output depends only on its own pre-activation.

Softmax behaves differently. In an output layer,

so each output depends not only on , but on all logits through the common denominator.

This is why the softmax layer is non-local:

  • changing affects directly;
  • the same change also affects all other outputs indirectly.

The non-zero off-diagonal derivatives

are the differential signature of this non-local coupling.

Local Versus Non-Local Output Layers

  • A sigmoid output is local: each activation depends only on its own logit.
  • A softmax output is non-local: every activation depends on all logits in the layer.
  • This non-locality is exactly what allows the output to be interpreted as a categorical probability distribution.

Why the name “Soft-Max”

The origin of the name becomes transparent by introducing a sharpening parameter :

The standard softmax corresponds to .

Why “Soft-Max”?

  1. For every , the vector is still a probability distribution, because
  2. Let be the set of maximizers. Then
  3. In particular, if the maximum is unique, say at class , then In that case the limit is the one-hot output associated with a hardmax selector.

This explains the name: softmax is a differentiable, softened version of a hardmax or argmax-like selection mechanism. At finite , probability mass is distributed across all classes; as grows, the distribution becomes increasingly concentrated on the largest logit.

Equivalent Temperature Parameter

In modern machine learning practice, especially in the context of probabilistic decoding and large language models, the same family of transformations is usually parameterized by a temperature rather than by a sharpening parameter .

The two parameterizations are equivalent:

Therefore the softmax may be written as

This notation is often more intuitive in applications:

  • as , the distribution becomes increasingly concentrated on the largest logit and approaches a hardmax distribution when the maximizer is unique;
  • as , all logits are flattened relative to one another and the distribution approaches the uniform distribution over classes.

Thus, low temperature makes the output more peaked and more deterministic, whereas high temperature makes it flatter and more random. In modern large language models, temperature is exactly the parameter used at sampling time to control how concentrated or exploratory the next-token distribution should be.

The same idea can be illustrated numerically. Consider the logits

Then:

  • a hardmax produces the one-hot vector ;
  • a naive normalization of the positive scores would produce approximately ;
  • the softmax output is approximately .

Softmax is therefore much closer to a hard winner-take-most rule than plain normalization, while remaining fully differentiable.

Shift Invariance and Numerical Stability

Softmax is invariant under the addition of the same constant to all logits:

where denotes the all-ones vector. Indeed,

This property is mathematically important and computationally useful. In practice, one usually subtracts before exponentiating:

The result is unchanged, but numerical overflow is avoided. This is one of the reasons why modern implementations compute softmax and cross-entropy from logits in carefully stabilized form.

Derivative of Softmax and Jacobian Structure

Since softmax maps to , its derivative is a Jacobian matrix:

To compute a generic entry, write

Then

while

Applying the quotient rule gives

Diagonal Entries

If , then

Off-Diagonal Entries

If , then

Combining the two cases yields the compact formula

Equivalently, the same derivative may be written in piecewise form as

where the Kronecker delta is defined by

This is exactly the derivative used in Softmax Loss to prove that, under softmax plus negative log-likelihood,

Matrix Form of the Jacobian

If is viewed as a column vector, the Jacobian can be written compactly as

Several important consequences follow immediately:

  • the diagonal entries are positive;
  • the off-diagonal entries are non-positive;
  • each row sum is zero;
  • each column sum is zero;
  • the Jacobian is singular.

These facts encode the two defining structural constraints of softmax:

  • the outputs must always sum to ;
  • adding the same constant to all logits leaves the output unchanged.

Equivalently, softmax depends only on relative logits, not on their common offset.

Softmax in One Sentence

Softmax is a differentiable, non-local normalization map that transforms logits into a categorical probability distribution, induces competition between classes, and has Jacobian