The diagnosis in ReLU identified the dying-ReLU problem as the price paid for the sharp half-line on which the derivative is exactly zero. Once a neuron’s pre-activation falls into this region, no gradient flows, no update occurs, and the neuron is permanently silent.
The natural family of fixes preserves the linear, non-saturating positive half of ReLU and softens the negative half so that the gradient is no longer exactly zero. The resulting variants are collectively called the *LU family (Leaky ReLU, PReLU, ELU, SELU, GELU). Each one trades off computational simplicity, smoothness, output statistics, and inductive bias differently, and each has a domain in which it dominates.
The shared structural pattern
Every member of the family has the form
for some continuous function with that keeps nonzero (or at least nonzero in expectation) for . The right half is always the identity; the left half is what differs.
Comparison table
| Activation | definition | for | Key property |
|---|---|---|---|
| Leaky ReLU | , fixed (e.g.\ ) | constant | Cheap, prevents dead neurons, no exponentials. |
| PReLU | same form as Leaky ReLU, but is learned | learnable parameter | Adaptable slope per channel; small parameter cost. |
| ELU | Smooth, saturates at ; output mean closer to zero. | ||
| SELU | with specific | scaled exponential | Self-normalizing: preserves unit variance across layers. |
| GELU | , the Gaussian CDF | smooth, non-monotonic | Probabilistic gating, default in modern Transformers. |
Leaky ReLU
The minimal modification of ReLU: replace the flat zero on with a line of small positive slope :
The typical choice is , occasionally as large as . The derivative is never exactly zero, so the dying-ReLU pathology is eliminated by construction: a neuron in the negative region still has a tiny but nonzero gradient , and gradient descent can in principle pull it back into the active region.
The cost is essentially nothing: no exponentials, one multiplication and a sign check per forward evaluation. Leaky ReLU is therefore the safe default upgrade over ReLU when dead neurons are suspected, and is widely used in computer vision (especially in GANs, where dying ReLU is a recurrent training pathology in the generator).
import torch.nn as nn
act = nn.LeakyReLU(negative_slope=0.01)PReLU (Parametric ReLU)
Introduced by He et al. (2015), PReLU generalizes Leaky ReLU by treating the negative slope not as a hyperparameter but as a learnable parameter :
The parameter is initialized to a small positive value (typically ) and updated by backpropagation alongside the weights. The negative slope can be shared across all channels (one scalar per layer) or per-channel (one scalar per output channel), with the latter being more expressive at slightly higher parameter cost.
The number of extra parameters per layer is, in the per-channel case, exactly equal to the number of output channels: a few dozen to a few hundred in a typical CNN layer, negligible compared to the weight matrices. The risk of overfitting from these extra parameters is real on small datasets but disappears on large ones.
import torch.nn as nn
act = nn.PReLU(num_parameters=64) # 64 channels => 64 learnable slopesPReLU was the activation used in the ImageNet-winning entry that introduced He initialization in the same paper; the two ideas were co-developed.
ELU (Exponential Linear Unit)
Clevert et al. (2016) replaced the linear negative branch with an exponential saturation:
Two structural properties make ELU distinctive.
- Smoothness. Unlike Leaky ReLU and PReLU, which have a kink at , ELU is continuously differentiable at the origin: and would be discontinuous, but the standard choice gives , removing the kink entirely. The resulting smooth landscape is friendlier to second-order optimizers.
- Negative mean. ELU outputs negative values for negative inputs, asymptotically approaching . The mean of the activations is therefore closer to zero than for ReLU (which is non-negative by construction), reducing the so-called bias shift that develops across layers when the activation mean is far from zero. This serves as a kind of implicit normalization.
The exponential is more expensive than a max or a multiplication, but on modern hardware the cost is negligible. ELU outperforms ReLU on many benchmarks of moderate depth and was a popular choice in deep CNNs before batch normalization became universal.
import torch.nn as nn
act = nn.ELU(alpha=1.0)SELU (Scaled ELU)
Klambauer, Unterthiner, Mayr and Hochreiter (2017) introduced SELU together with the theory of self-normalizing neural networks. SELU is ELU with two specific, carefully chosen scale factors:
Where the constants and come from
Consider a deep FNN with weights initialized so that the pre-activations have mean and variance at every layer, and let denote the mean and variance of the activations after applying SELU. The recursion that maps to is a deterministic two-dimensional map in space, computable as a Gaussian integral against the SELU function.
The constants and in the SELU definition are chosen so that the point is a fixed point of this map: if the inputs to a layer have mean and variance , the outputs of that layer do too. Klambauer et al. solve the fixed-point equations numerically, obtaining the values and stated above. They further show that this fixed point is attracting under mild assumptions, so activations relax to across layers regardless of the initial distribution.
In other words: SELU self-normalizes. Batch normalization (or layer normalization) becomes unnecessary in a pure SELU FNN of moderate width. The constants are the specific arithmetic that makes the fixed-point machinery work; any other choice of would break the property.
SELU's narrow but compelling niche
SELU’s self-normalizing guarantee holds for fully connected networks. Convolutional and recurrent architectures violate the assumptions of the fixed-point argument, so SELU’s theoretical guarantee does not carry over. In practice, SELU is the strongest default for deep MLPs trained without explicit normalization layers, but it is rarely the right choice for CNNs (where BatchNorm + ReLU is universal) or Transformers (where LayerNorm + GELU is universal).
import torch.nn as nn
act = nn.SELU()
# Pair with LeCun normal initialization:
torch.nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='linear')GELU (Gaussian Error Linear Unit)
Hendrycks and Gimpel (2016) introduced GELU as a smooth, probabilistic alternative to ReLU. The definition is
Here is the cumulative distribution function (CDF) of the standard normal: is the probability that a draw lands at or below , so it rises smoothly from for large negative to for large positive . GELU therefore scales each input by how likely a standard Gaussian is to fall below it, passing large positive values through almost unchanged and pushing large negative values toward zero. This probability reading is exactly what the Bernoulli-gate interpretation below makes precise.
GELU is computed in practice via an approximation, typically the tanh approximation
The two constants are fitted, not fundamental. The bracket is engineered so that tracks (the rescaling of the Gaussian CDF into the range of ): the factor matches the slope at the origin, and the cubic coefficient is tuned so the approximation stays accurate into the tails. The approximation was introduced when evaluating the exact error function was comparatively expensive; current PyTorch defaults to the exact and keeps this tanh form only as an option.
The intuition is best captured by the stochastic regularization interpretation. Define a random variable with (a Bernoulli gate whose firing probability depends on the input itself). Then
GELU is, in this sense, the expected output of a dropout-like mechanism whose drop probability is determined by the input. Inputs with large positive are almost always kept; inputs with large negative are almost always dropped; inputs near zero are kept with a probability that increases smoothly with .
Two practical consequences of this construction:
- Smoothness everywhere. GELU is , with no kinks. The loss landscape under GELU is smoother than under ReLU, which helps Adam-style optimizers in the kinds of complex landscapes typical of large language models.
- Non-monotonicity. GELU dips slightly below zero for moderate negative inputs before saturating near zero. This non-monotone bump is an architectural feature: it allows the activation to express “weak inhibition” in a way that strictly monotone activations cannot.
GELU is the dominant activation in modern NLP
Almost every large Transformer-based language model since 2018 uses GELU as its hidden activation: BERT, GPT-2, GPT-3, GPT-4, T5, LLaMA (which uses the closely-related SwiGLU variant). The smoothness, the probabilistic interpretation, and the fact that GELU avoids the dying-neuron pathology of ReLU on low-diversity token sequences combine to make it the de-facto standard.
import torch.nn as nn
act = nn.GELU() # Uses the exact erf-based formula by default in PyTorch
# act = nn.GELU(approximate='tanh') # Use the tanh approximation explicitlyA word on Swish / SiLU
Closely related to GELU, the Swish activation (Ramachandran et al., 2017), also called SiLU, is defined as
where is the sigmoid. The shape is very similar to GELU: smooth, non-monotonic, identity in the positive limit, near-zero in the negative limit. The choice between GELU and Swish is largely empirical; both consistently outperform ReLU on Transformer-scale models. Recent LLM architectures (LLaMA, PaLM) use the SwiGLU gating, which combines Swish with a learned gate similar in spirit to the gating in GRUs.
Selection guide
The decision tree below captures the operational defaults that emerge from current practice.
| Situation | Recommended activation | Reason |
|---|---|---|
| Standard CNN, BatchNorm in use | ReLU | Simplicity, computational economy, BatchNorm absorbs the bias-shift problem. |
| CNN, dying-ReLU suspected | Leaky ReLU () | Trivial upgrade, no extra parameters. |
| GAN generator/discriminator | Leaky ReLU | Empirically more stable; avoids dying neurons in adversarial training. |
| Very deep FNN, no normalization | SELU + LeCun normal init | Self-normalization removes the need for explicit normalization layers. |
| Transformer encoder/decoder | GELU (or Swish/SiLU) | Smoothness helps Adam-class optimizers; standard in modern LLMs. |
| Reinforcement learning | ELU or GELU | Smoother gradient surface; avoids dead neurons during sparse-reward training. |
Practical rule of thumb
- For a first baseline on a new task, start with ReLU. It is the simplest, the fastest, and rarely wrong by much.
- If training is unstable or many neurons die, switch to Leaky ReLU or GELU. The change is one line of PyTorch code and frequently fixes the symptom.
- For Transformer or transformer-like architectures, use GELU by default; this is what the published reference implementations do.
- For deep MLPs without normalization, use SELU with the correct initialization and standardized inputs; the self-normalizing guarantee is real and useful.
The choice of activation function fixes one cause of vanishing gradients, but a second cause remains: even with a perfect activation, the pre-activations can land in unfortunate regions purely because of how the weights are initialized. The next note analyzes the default Gaussian initialization and shows exactly why it fails in deep networks.
Sources
The variants discussed here (PReLU, GELU, Swish), with the broader initialization and normalization literature, are collected in Initialization, Activations, and Normalization.