The /√d_k scaling prevents dot products from getting too large and making softmax collapse to a one-hot. Without it, larger d_k means larger raw dot products — the scaling keeps scores in a stable range regardless of model size.
Causal masking — in decoder-only models (GPT-style), each token can only attend to itself and earlier tokens. Future positions are masked to −∞ before softmax, so their attention weight becomes zero. The example above works because “sat” only attends to tokens that came before it or at the same position.
Step 3 — softmax to get attention weights
softmax over [0.7, 2.1, 1.3]:
α = [0.15, 0.59, 0.27] ← attention weights, sum=1
Step 4 — weighted sum of Value vectors
output_sat = α_0·V_the + α_1·V_cat + α_2·V_sat
= 0.15·V_the + 0.59·V_cat + 0.27·V_sat
= [ 0.54, -0.11, 0.63, 0.08] ← enriched "sat"
“sat” now carries information about “cat” because it attended to it heavily.
Real models run 32–64 of these attention heads in parallel, each with its own W_Q, W_K, W_V.
Terms used here
Symbol
Meaning
W_Q, W_K, W_V
learned weight matrices that project x into Q, K, V spaces
Q
matrix of query vectors — one row per token, computed as x · W_Q for each x
K
matrix of key vectors — one row per token, computed as x · W_K for each x
V
matrix of value vectors — one row per token, computed as x · W_V for each x
Q_i
query vector of token i (row i of Q)
K_j
key vector of token j (row j of K)
V_j
value vector of token j (row j of V)
d_k
dimension of the Q/K vectors — used for scaling to prevent large dot products
score(i,j)
raw attention score: how much token i should look at token j
α
attention weights after softmax — how much of each V to mix in
02 — Feed-Forward Network (FFN)
After attention, each token vector passes through an FFN independently — no cross-token communication, just transformation.
The FFN is where factual knowledge is stored. Different neurons fire for different concepts.
Real dimensions: 4096 → 16384 → 4096, repeated across every layer.
Terms used here
Symbol
Meaning
W₁, b₁
learned weights and bias for the expand step (shape: d × 4d)
W₂, b₂
learned weights and bias for the contract step (shape: 4d × d)
h
intermediate hidden state in the expanded dimension
GELU
activation function — introduces non-linearity so stacked linear layers can’t collapse into one
4d
expanded dimension (4× model dimension) — gives the network more “workspace”
03 — Residual Connection + LayerNorm
Two structural components that stabilize both training and inference. They wrap every attention block and every FFN block:
Attention → Add+Norm → FFN → Add+Norm
Add+Norm after Attention (using “sat” from section 01)
x = [-0.40, 0.60, 0.20, -0.90] <- "sat" input to this layer (from §00)
-> [ -0.05, 1.06, 0.58, -1.59] <- output of this full transformer layer
This output feeds into the next layer’s attention as x, and the process repeats N times.
LayerNorm vs Softmax — both “normalize” but for different purposes:
softmax turns scores into a probability distribution: all positive, sum to 1. Used when choosing between options (attention weights, token probabilities).
LayerNorm re-centers and rescales a vector: values can still be negative, do not sum to 1. Used to keep activations in a stable numerical range between layers.
softmax([0.7, 2.1, 1.3]) -> [ 0.15, 0.59, 0.27] <- probability distribution