Graph attention networks

Dynamic Anisotropic Neighborhood Aggregation in Graph Attention Networks

Limitation of isotropic aggregation

In conventional message-passing neural networks, the aggregation operator is typically isotropic, for example a sum or an average over neighboring messages. Although such operators satisfy permutation invariance, they implicitly assume that neighboring nodes contribute uniformly, or that their influence is determined only by fixed structural quantities such as node degree.

Attentional remedy

Graph Attention Networks (GATs) replace uniform aggregation with a learned, feature-dependent anisotropic weighting scheme. The contribution of each neighboring message is determined dynamically from the semantic compatibility between the interacting node representations.

Graph attention therefore introduces a data-dependent inductive bias into neighborhood aggregation: not all neighbors are equally informative, and their relative importance should be inferred from the node features themselves.

Reading the figure

The target node is updated by comparing its representation with those of its neighbors, assigning a normalized relevance weight to each neighbor, and aggregating the corresponding feature vectors into a new node representation. The color coding links each neighbor to its associated interaction score and value vector, thereby making the flow from local interaction to final update visually explicit.

Terminological note

Labels such as Key and Query are often used heuristically in graph-attention illustrations and should not be over-interpreted. The essential operation is the computation of a learned compatibility score between the target node and each neighboring node. The mathematical content lies in the attention coefficients, not in the specific naming convention adopted by the diagram.

1. General Attentional Aggregation

Let

\tilde{N} (i) = N (i) \cup {i}

denote the neighborhood of node $i$ augmented with a self-loop. The attentional aggregation performed at layer $l$ is then written as

z_{i}^{(l)} = j \in \tilde{N} (i) \sum α_{ij}^{(l)} h_{j}^{(l)},

where $h_{j}^{(l)}$ is the representation of node $j$ at layer $l$ , and $α_{ij}^{(l)}$ is the attention coefficient assigned to node $j$ when updating node $i$ .

To ensure numerical stability and comparability across neighborhoods of different sizes, the coefficients are constrained to form a probability distribution:

α_{ij}^{(l)} \geq 0, j \in \tilde{N} (i) \sum α_{ij}^{(l)} = 1.

This normalization is obtained through a softmax applied to learned pairwise interaction scores.

2. The Canonical GAT Layer

The original GAT layer computes the updated representation of node $i$ through four conceptually distinct steps.

2.1 Shared Linear Projection

All node features are first mapped into a learned latent space through a shared linear transformation:

\tilde{h}_{i} = W h_{i} .

Here, $W$ is a trainable matrix shared across all nodes.

2.2 Pairwise Interaction Scores

A shared attentional mechanism computes an unnormalized score $e_{ij}$ for each ordered pair $(i, j)$ with $j \in \tilde{N} (i)$ . In the canonical GAT architecture, this score is produced by a single-layer feedforward mechanism followed by a LeakyReLU nonlinearity:

e_{ij} = LeakyReLU (a^{⊤} [\tilde{h}_{i} ∥ \tilde{h}_{j}]),

where $a$ is a shared trainable vector and $∥$ denotes concatenation.

2.3 Softmax Normalization

The raw scores are normalized over the augmented neighborhood of node $i$ :

α_{ij} = \frac{exp ( e _{ij} )}{\sum _{k \in \tilde{N} (i)} exp ( e _{ik} )} .

2.4 Weighted Aggregation

The updated representation is obtained by aggregating the projected neighbor features with attention-dependent weights:

h_{i}^{'} = σ j \in \tilde{N} (i) \sum α_{ij} \tilde{h}_{j},

where $σ$ denotes a pointwise nonlinearity.

3. Asymmetry and Permutation Equivariance

A frequent misconception is that attention on an undirected graph must itself be symmetric. This is not required in the canonical GAT formulation. Because the score is computed from the ordered concatenation

[\tilde{h}_{i} ∥ \tilde{h}_{j}],

one generally has

α_{ij} \neq = α_{ji} .

This asymmetry is not a flaw; rather, it allows the model to learn directed information flow even when the underlying graph is structurally undirected.

Why equivariance is preserved

Permutation equivariance does not rely on symmetric edge scores. It is guaranteed by three architectural properties: global parameter sharing, locality with respect to the graph adjacency, and the permutation-invariant summation over neighbors.

4. Alternative Attention Mechanisms

The canonical concatenation-based mechanism is only one possible realization of attentional message passing. More general score functions can be employed.

4.1 Bilinear Interaction

A computationally efficient alternative uses a learned bilinear form:

α_{ij} = \frac{exp ( h _{i}^{⊤} M h _{j} )}{\sum _{k \in \tilde{N} (i)} exp ( h _{i}^{⊤} M h _{k} )},

where $M$ is a trainable interaction matrix.

4.2 Non-Linear MLP Scoring

A more flexible alternative uses a multilayer perceptron:

α_{ij} = \frac{exp ( MLP ( [ h _{i} ∥ h _{j} ] ) )}{\sum _{k \in \tilde{N} (i)} exp ( MLP ( [ h _{i} ∥ h _{k} ] ) )} .

Symmetry is optional, not automatic

If a specific application requires symmetric edge scoring, the scoring architecture must be explicitly constructed to satisfy that constraint. It does not follow from graph undirectedness alone.

5. Multi-Head Graph Attention

Single-head attention may exhibit high variance during optimization and may fail to capture multiple distinct relational patterns simultaneously. This motivates the use of multi-head graph attention.

With $K$ independent heads, the hidden-layer update is typically defined by concatenation:

h_{i}^{'} = ∥_{k = 1}^{K} σ j \in \tilde{N} (i) \sum α_{ij}^{(k)} W^{(k)} h_{j} .

At the final prediction layer, averaging is often preferred over concatenation:

h_{i}^{'} = σ \frac{1}{K} k = 1 \sum K j \in \tilde{N} (i) \sum α_{ij}^{(k)} W^{(k)} h_{j} .

Functional role of multi-head attention

Multi-head attention stabilizes training and allows different heads to specialize in distinct relational patterns, thereby increasing representational flexibility.

6. Relation to the Transformer

Multi-head graph attention reveals a deep conceptual connection between GNNs and Transformers.

Transformers are graphs nns

A standard Transformer encoder may be interpreted as a multi-head attentional message-passing architecture operating on a fully connected graph, that is, on a graph in which every token can exchange information with every other token.

However, strict equivalence requires care. Standard Transformers extend the basic GAT mechanism through distinct linear projections for queries, keys, and values:

q_{i} = Q h_{i}, k_{j} = K h_{j}, v_{j} = V h_{j},

followed by attention scores of the form

α_{ij} = softmax_{j} (\frac{q _{i}^{⊤} k _{j}}{d}) .

In addition, Transformer encoders rely on positional encodings, residual connections, and normalization layers. What remains common across both architectures is the central principle: message importance is not fixed a priori but learned from pairwise interactions between representations.

Core takeaway

Graph Attention Networks replace uniform neighborhood aggregation with learned, normalized, feature-dependent weighting. This transforms message passing from an isotropic structural operation into an anisotropic relational inference mechanism.

Deep Learning: Zero to Hero

Explorer

Graph attention networks

Dynamic Anisotropic Neighborhood Aggregation in Graph Attention Networks

1. General Attentional Aggregation

2. The Canonical GAT Layer

2.1 Shared Linear Projection

2.2 Pairwise Interaction Scores

2.3 Softmax Normalization

2.4 Weighted Aggregation

3. Asymmetry and Permutation Equivariance

4. Alternative Attention Mechanisms

4.1 Bilinear Interaction

4.2 Non-Linear MLP Scoring

5. Multi-Head Graph Attention

6. Relation to the Transformer

Graph View

Table of Contents