The *LU family

The diagnosis in ReLU identified the dying-ReLU problem as the price paid for the sharp half-line $z < 0$ on which the derivative is exactly zero. Once a neuron’s pre-activation falls into this region, no gradient flows, no update occurs, and the neuron is permanently silent.

The natural family of fixes preserves the linear, non-saturating positive half of ReLU and softens the negative half so that the gradient is no longer exactly zero. The resulting variants are collectively called the *LU family (Leaky ReLU, PReLU, ELU, SELU, GELU). Each one trades off computational simplicity, smoothness, output statistics, and inductive bias differently, and each has a domain in which it dominates.

The shared structural pattern

Every member of the family has the form
$f (z) = {z g (z) if z \geq 0, if z < 0,$
for some continuous function $g$ with $g (0) = 0$ that keeps $f^{'} (z)$ nonzero (or at least nonzero in expectation) for $z < 0$ . The right half is always the identity; the left half is what differs.

Comparison table

Activation	$f (z)$ definition	$f^{'} (z)$ for $z < 0$	Key property
Leaky ReLU	$f (z) = {z α z z \geq 0 z < 0$ , fixed $α ≪ 1$ (e.g.\ $0.01$ )	constant $α$	Cheap, prevents dead neurons, no exponentials.
PReLU	same form as Leaky ReLU, but $α$ is learned	learnable parameter	Adaptable slope per channel; small parameter cost.
ELU	$f (z) = {z α (e^{z} - 1) z \geq 0 z < 0$	$α e^{z}$	Smooth, saturates at $- α$ ; output mean closer to zero.
SELU	$λ \cdot ELU (z)$ with specific $(λ, α)$	scaled exponential	Self-normalizing: preserves unit variance across layers.
GELU	$f (z) = z Φ (z)$ , $Φ$ the Gaussian CDF	smooth, non-monotonic	Probabilistic gating, default in modern Transformers.

Leaky ReLU

The minimal modification of ReLU: replace the flat zero on $z < 0$ with a line of small positive slope $α$ :

LeakyReLU_{α} (z) = {z α z if z \geq 0, if z < 0, LeakyReLU_{α}^{'} (z) = {1 α if z > 0, if z < 0.

The typical choice is $α = 0.01$ , occasionally as large as $0.1$ . The derivative is never exactly zero, so the dying-ReLU pathology is eliminated by construction: a neuron in the negative region still has a tiny but nonzero gradient $α$ , and gradient descent can in principle pull it back into the active region.

The cost is essentially nothing: no exponentials, one multiplication and a sign check per forward evaluation. Leaky ReLU is therefore the safe default upgrade over ReLU when dead neurons are suspected, and is widely used in computer vision (especially in GANs, where dying ReLU is a recurrent training pathology in the generator).

import torch.nn as nn
act = nn.LeakyReLU(negative_slope=0.01)

PReLU (Parametric ReLU)

Introduced by He et al. (2015), PReLU generalizes Leaky ReLU by treating the negative slope not as a hyperparameter but as a learnable parameter $a$ :

PReLU_{a} (z) = {z a z if z \geq 0, if z < 0.

The parameter $a$ is initialized to a small positive value (typically $0.25$ ) and updated by backpropagation alongside the weights. The negative slope can be shared across all channels (one scalar per layer) or per-channel (one scalar per output channel), with the latter being more expressive at slightly higher parameter cost.

The number of extra parameters per layer is, in the per-channel case, exactly equal to the number of output channels: a few dozen to a few hundred in a typical CNN layer, negligible compared to the weight matrices. The risk of overfitting from these extra parameters is real on small datasets but disappears on large ones.

import torch.nn as nn
act = nn.PReLU(num_parameters=64)   # 64 channels => 64 learnable slopes

PReLU was the activation used in the ImageNet-winning entry that introduced He initialization in the same paper; the two ideas were co-developed.

ELU (Exponential Linear Unit)

Clevert et al. (2016) replaced the linear negative branch with an exponential saturation:

ELU_{α} (z) = {z α (e^{z} - 1) if z \geq 0, if z < 0, ELU_{α}^{'} (z) = {1 α e^{z} if z > 0, if z < 0.

Two structural properties make ELU distinctive.

Smoothness. Unlike Leaky ReLU and PReLU, which have a kink at $z = 0$ , ELU is continuously differentiable at the origin: $f^{'} (0^{-}) = α \cdot 1 = α$ and $f^{'} (0^{+}) = 1$ would be discontinuous, but the standard choice $α = 1$ gives $f^{'} (0^{-}) = 1 = f^{'} (0^{+})$ , removing the kink entirely. The resulting smooth landscape is friendlier to second-order optimizers.
Negative mean. ELU outputs negative values for negative inputs, asymptotically approaching $- α$ . The mean of the activations is therefore closer to zero than for ReLU (which is non-negative by construction), reducing the so-called bias shift that develops across layers when the activation mean is far from zero. This serves as a kind of implicit normalization.

The exponential is more expensive than a max or a multiplication, but on modern hardware the cost is negligible. ELU outperforms ReLU on many benchmarks of moderate depth and was a popular choice in deep CNNs before batch normalization became universal.

import torch.nn as nn
act = nn.ELU(alpha=1.0)

SELU (Scaled ELU)

Klambauer, Unterthiner, Mayr and Hochreiter (2017) introduced SELU together with the theory of self-normalizing neural networks. SELU is ELU with two specific, carefully chosen scale factors:

SELU (z) = λ {z α (e^{z} - 1) if z \geq 0, if z < 0, λ \approx 1.0507, α \approx 1.6732.

Where the constants $λ$ and $α$ come from

Consider a deep FNN with weights initialized so that the pre-activations $z^{ℓ}$ have mean $0$ and variance $1$ at every layer, and let $μ^{ℓ}, ν^{ℓ}$ denote the mean and variance of the activations after applying SELU. The recursion that maps $(μ^{ℓ - 1}, ν^{ℓ - 1})$ to $(μ^{ℓ}, ν^{ℓ})$ is a deterministic two-dimensional map in $(μ, ν)$ space, computable as a Gaussian integral against the SELU function.

The constants $λ$ and $α$ in the SELU definition are chosen so that the point $(μ, ν) = (0, 1)$ is a fixed point of this map: if the inputs to a layer have mean $0$ and variance $1$ , the outputs of that layer do too. Klambauer et al. solve the fixed-point equations numerically, obtaining the values $λ \approx 1.0507$ and $α \approx 1.6732$ stated above. They further show that this fixed point is attracting under mild assumptions, so activations relax to $(0, 1)$ across layers regardless of the initial distribution.

In other words: SELU self-normalizes. Batch normalization (or layer normalization) becomes unnecessary in a pure SELU FNN of moderate width. The constants are the specific arithmetic that makes the fixed-point machinery work; any other choice of $(λ, α)$ would break the property.

SELU's narrow but compelling niche

SELU’s self-normalizing guarantee holds for fully connected networks. Convolutional and recurrent architectures violate the assumptions of the fixed-point argument, so SELU’s theoretical guarantee does not carry over. In practice, SELU is the strongest default for deep MLPs trained without explicit normalization layers, but it is rarely the right choice for CNNs (where BatchNorm + ReLU is universal) or Transformers (where LayerNorm + GELU is universal).

import torch.nn as nn
act = nn.SELU()
# Pair with LeCun normal initialization:
torch.nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='linear')

GELU (Gaussian Error Linear Unit)

Hendrycks and Gimpel (2016) introduced GELU as a smooth, probabilistic alternative to ReLU. The definition is

GELU (z) = z \cdot Φ (z), Φ (z) = P (X \leq z), X \sim N (0, 1) .

Here $Φ$ is the cumulative distribution function (CDF) of the standard normal: $Φ (z)$ is the probability that a draw $X \sim N (0, 1)$ lands at or below $z$ , so it rises smoothly from $0$ for large negative $z$ to $1$ for large positive $z$ . GELU therefore scales each input by how likely a standard Gaussian is to fall below it, passing large positive values through almost unchanged and pushing large negative values toward zero. This probability reading is exactly what the Bernoulli-gate interpretation below makes precise.

GELU is computed in practice via an approximation, typically the tanh approximation

GELU (z) \approx 0.5 z (1 + tanh [2/ π (z + 0.044715 z^{3})]) .

The two constants are fitted, not fundamental. The bracket is engineered so that $tanh [\cdot]$ tracks $2Φ (z) - 1$ (the rescaling of the Gaussian CDF into the range of $tanh$ ): the factor $2/ π$ matches the slope at the origin, and the cubic coefficient $0.044715$ is tuned so the approximation stays accurate into the tails. The approximation was introduced when evaluating the exact error function was comparatively expensive; current PyTorch defaults to the exact $Φ$ and keeps this tanh form only as an option.

The intuition is best captured by the stochastic regularization interpretation. Define a random variable $M \in {0, 1}$ with $P (M = 1) = Φ (z)$ (a Bernoulli gate whose firing probability depends on the input itself). Then

E [M \cdot z] = z \cdot Φ (z) = GELU (z) .

GELU is, in this sense, the expected output of a dropout-like mechanism whose drop probability is determined by the input. Inputs with large positive $z$ are almost always kept; inputs with large negative $z$ are almost always dropped; inputs near zero are kept with a probability that increases smoothly with $z$ .

Two practical consequences of this construction:

Smoothness everywhere. GELU is $C^{\infty}$ , with no kinks. The loss landscape under GELU is smoother than under ReLU, which helps Adam-style optimizers in the kinds of complex landscapes typical of large language models.
Non-monotonicity. GELU dips slightly below zero for moderate negative inputs before saturating near zero. This non-monotone bump is an architectural feature: it allows the activation to express “weak inhibition” in a way that strictly monotone activations cannot.

GELU is the dominant activation in modern NLP

Almost every large Transformer-based language model since 2018 uses GELU as its hidden activation: BERT, GPT-2, GPT-3, GPT-4, T5, LLaMA (which uses the closely-related SwiGLU variant). The smoothness, the probabilistic interpretation, and the fact that GELU avoids the dying-neuron pathology of ReLU on low-diversity token sequences combine to make it the de-facto standard.

import torch.nn as nn
act = nn.GELU()  # Uses the exact erf-based formula by default in PyTorch
# act = nn.GELU(approximate='tanh')  # Use the tanh approximation explicitly

A word on Swish / SiLU

Closely related to GELU, the Swish activation (Ramachandran et al., 2017), also called SiLU, is defined as

Swish (z) = z \cdot σ (z),

where $σ$ is the sigmoid. The shape is very similar to GELU: smooth, non-monotonic, identity in the positive limit, near-zero in the negative limit. The choice between GELU and Swish is largely empirical; both consistently outperform ReLU on Transformer-scale models. Recent LLM architectures (LLaMA, PaLM) use the SwiGLU gating, which combines Swish with a learned gate similar in spirit to the gating in GRUs.

Selection guide

The decision tree below captures the operational defaults that emerge from current practice.

Situation	Recommended activation	Reason
Standard CNN, BatchNorm in use	ReLU	Simplicity, computational economy, BatchNorm absorbs the bias-shift problem.
CNN, dying-ReLU suspected	Leaky ReLU ( $α = 0.01$ )	Trivial upgrade, no extra parameters.
GAN generator/discriminator	Leaky ReLU	Empirically more stable; avoids dying neurons in adversarial training.
Very deep FNN, no normalization	SELU + LeCun normal init	Self-normalization removes the need for explicit normalization layers.
Transformer encoder/decoder	GELU (or Swish/SiLU)	Smoothness helps Adam-class optimizers; standard in modern LLMs.
Reinforcement learning	ELU or GELU	Smoother gradient surface; avoids dead neurons during sparse-reward training.

Practical rule of thumb

For a first baseline on a new task, start with ReLU. It is the simplest, the fastest, and rarely wrong by much.

If training is unstable or many neurons die, switch to Leaky ReLU or GELU. The change is one line of PyTorch code and frequently fixes the symptom.

For Transformer or transformer-like architectures, use GELU by default; this is what the published reference implementations do.

For deep MLPs without normalization, use SELU with the correct initialization and standardized inputs; the self-normalizing guarantee is real and useful.

The choice of activation function fixes one cause of vanishing gradients, but a second cause remains: even with a perfect activation, the pre-activations $z^{ℓ}$ can land in unfortunate regions purely because of how the weights are initialized. The next note analyzes the default Gaussian initialization and shows exactly why it fails in deep networks.

Sources

The variants discussed here (PReLU, GELU, Swish), with the broader initialization and normalization literature, are collected in Initialization, Activations, and Normalization.

Deep Learning: Zero to Hero

Explorer

Comparison table

Leaky ReLU

PReLU (Parametric ReLU)

ELU (Exponential Linear Unit)

SELU (Scaled ELU)

GELU (Gaussian Error Linear Unit)

A word on Swish / SiLU

Selection guide

Graph View

Table of Contents

Backlinks