Home Blog About
Hero image for Transformer Architecture
· 10 min read · -- views

Transformer Architecture


Transformer Architecture

first principles · dense walkthrough · toy numbers


00 — Tokenization + Embedding

Text is split into tokens. Each maps to an integer ID, then looked up in an embedding table to get a dense vector.

sentence: "the cat sat"
token IDs: the→4, cat→912, sat→3041
embedding lookup (d=4, toy):
the → [ 0.2, 0.8, -0.1, 0.5]
cat → [ 0.9, -0.3, 0.7, 0.1]
sat → [-0.4, 0.6, 0.2, -0.9]

+ Positional Encoding — transformers have no built-in sense of order. A positional vector is added to each embedding:

the [ 0.2, 0.8, -0.1, 0.5] ← token embed
+ [ 0.0, 1.0, 0.0, 1.0] ← pos(0)
= [ 0.2, 1.8, -0.1, 1.5] ← final input vector for "the"

Terms used here

SymbolMeaning
dembedding dimension (size of each token vector). Toy=4, real=4096
xa single token’s vector (one row of the embedding table)
pos(i)positional encoding added to token at position i

01 — Self-Attention

Every token asks: “who should I pay attention to?”

Step 1 — project each token into Q, K, V

For every token in the sequence, multiply its vector x by three learned weight matrices:

Q = x · W_Q ← "what am I looking for?"
K = x · W_K ← "what do I offer to others?"
V = x · W_V ← "what information do I carry?"

This is done for every token, giving us three matrices (one row per token):

Q matrix K matrix V matrix
the → [q vectors] [k vectors] [v vectors]
cat → [q vectors] [k vectors] [v vectors]
sat → [q vectors] [k vectors] [v vectors]

Q_i means: the Q vector of token i (row i of the Q matrix). K_j means: the K vector of token j (row j of the K matrix).


Step 2 — compute attention scores between every pair of tokens

How much should token i attend to token j? Dot the query of i against the key of j, then scale:

score(i, j) = Q_i · K_j ᵀ / √d_k
d_k = 4 (toy), so √d_k = √4 = 2
toy example — "sat" (i=2) attending to all tokens (j=0,1,2):
score(sat, the) = Q_sat · K_the / 2 = 1.4 / 2 = 0.70
score(sat, cat) = Q_sat · K_cat / 2 = 4.2 / 2 = 2.10 ← highest
score(sat, sat) = Q_sat · K_sat / 2 = 2.6 / 2 = 1.30

The /√d_k scaling prevents dot products from getting too large and making softmax collapse to a one-hot. Without it, larger d_k means larger raw dot products — the scaling keeps scores in a stable range regardless of model size.

Causal masking — in decoder-only models (GPT-style), each token can only attend to itself and earlier tokens. Future positions are masked to −∞ before softmax, so their attention weight becomes zero. The example above works because “sat” only attends to tokens that came before it or at the same position.


Step 3 — softmax to get attention weights

softmax over [0.7, 2.1, 1.3]:
α = [0.15, 0.59, 0.27] ← attention weights, sum=1

Step 4 — weighted sum of Value vectors

output_sat = α_0·V_the + α_1·V_cat + α_2·V_sat
= 0.15·V_the + 0.59·V_cat + 0.27·V_sat
= [ 0.54, -0.11, 0.63, 0.08] ← enriched "sat"

“sat” now carries information about “cat” because it attended to it heavily. Real models run 32–64 of these attention heads in parallel, each with its own W_Q, W_K, W_V.

Terms used here

SymbolMeaning
W_Q, W_K, W_Vlearned weight matrices that project x into Q, K, V spaces
Qmatrix of query vectors — one row per token, computed as x · W_Q for each x
Kmatrix of key vectors — one row per token, computed as x · W_K for each x
Vmatrix of value vectors — one row per token, computed as x · W_V for each x
Q_iquery vector of token i (row i of Q)
K_jkey vector of token j (row j of K)
V_jvalue vector of token j (row j of V)
d_kdimension of the Q/K vectors — used for scaling to prevent large dot products
score(i,j)raw attention score: how much token i should look at token j
αattention weights after softmax — how much of each V to mix in

02 — Feed-Forward Network (FFN)

After attention, each token vector passes through an FFN independently — no cross-token communication, just transformation.

input (d=4): [ 0.54, -0.11, 0.63, 0.08]
step 1 — expand (d → 4d = 16):
h = GELU(input · W₁ + b₁)
h = [ 0.0, 1.2, 0.0, 0.7, ...] ← sparse activations
step 2 — contract (4d → d):
output = h · W₂ + b₂
output = [ 0.31, 0.77, -0.22, 0.45]

The FFN is where factual knowledge is stored. Different neurons fire for different concepts. Real dimensions: 4096 → 16384 → 4096, repeated across every layer.

Terms used here

SymbolMeaning
W₁, b₁learned weights and bias for the expand step (shape: d × 4d)
W₂, b₂learned weights and bias for the contract step (shape: 4d × d)
hintermediate hidden state in the expanded dimension
GELUactivation function — introduces non-linearity so stacked linear layers can’t collapse into one
4dexpanded dimension (4× model dimension) — gives the network more “workspace”

03 — Residual Connection + LayerNorm

Two structural components that stabilize both training and inference. They wrap every attention block and every FFN block:

Attention → Add+Norm → FFN → Add+Norm

Add+Norm after Attention (using “sat” from section 01)

x = [-0.40, 0.60, 0.20, -0.90] <- "sat" input to this layer (from §00)
attention(x) = [ 0.54, -0.11, 0.63, 0.08] <- attention output (from §01)
residual:
x + attention(x) = [-0.40+0.54, 0.60-0.11, 0.20+0.63, -0.90+0.08]
= [ 0.14, 0.49, 0.83, -0.82]
layernorm:
mean = (0.14 + 0.49 + 0.83 - 0.82) / 4 = 0.16
variance = mean of squared deviations = 0.38
std = sqrt(0.38) ~= 0.62
normalized = (each value - mean) / std:
(0.14 - 0.16) / 0.62 = -0.03
(0.49 - 0.16) / 0.62 = 0.53
(0.83 - 0.16) / 0.62 = 1.08
(-0.82 - 0.16) / 0.62 = -1.58
-> [ -0.03, 0.53, 1.08, -1.58] <- input to FFN

Add+Norm after FFN (continuing from above)

x_ffn_in = [-0.03, 0.53, 1.08, -1.58] <- input to FFN (from above)
ffn(x) = [ 0.31, 0.77, -0.22, 0.45] <- FFN output (from §02)
residual:
x_ffn_in + ffn(x) = [-0.03+0.31, 0.53+0.77, 1.08-0.22, -1.58+0.45]
= [ 0.28, 1.30, 0.86, -1.13]
layernorm:
mean = (0.28 + 1.30 + 0.86 - 1.13) / 4 = 0.33
std ~= 0.92
normalized:
(0.28 - 0.33) / 0.92 = -0.05
(1.30 - 0.33) / 0.92 = 1.06
(0.86 - 0.33) / 0.92 = 0.58
(-1.13 - 0.33) / 0.92 = -1.59
-> [ -0.05, 1.06, 0.58, -1.59] <- output of this full transformer layer

This output feeds into the next layer’s attention as x, and the process repeats N times.


LayerNorm vs Softmax — both “normalize” but for different purposes:

  • softmax turns scores into a probability distribution: all positive, sum to 1. Used when choosing between options (attention weights, token probabilities).
  • LayerNorm re-centers and rescales a vector: values can still be negative, do not sum to 1. Used to keep activations in a stable numerical range between layers.
softmax([0.7, 2.1, 1.3]) -> [ 0.15, 0.59, 0.27] <- probability distribution
layernorm([0.7, 2.1, 1.3]) -> [-1.16, 1.28, -0.12] <- same shape, re-centered

Softmax answers “which one?” — LayerNorm answers “keep this well-behaved.”

Terms used here

SymbolMeaning
xinput to the block — the token vector before attention or FFN
layer(x)output of the attention or FFN block
x + layer(x)residual: original input added back before normalizing
mean, stdcomputed across the 4 dimensions of the vector (not across tokens)
LNLayerNorm — subtract mean, divide by std, then scale/shift by learned γ, β

04 — LM Head → Logits

After N transformer layers, take the last token’s hidden state. Project it to the size of the full vocabulary:

logits = hidden_state · W_vocab
last hidden state (d=4):
[-0.05, 1.06, 0.58, -1.59]
raw logits (toy vocab of 6):
"." → 2.1
"on" → 3.8 ← highest
"down" → 2.7
"the" → 0.4
"a" → 1.3
"and" → -1.1

W_vocab is usually weight-tied to the input embedding matrix (transposed). Saves billions of parameters.

Terms used here

SymbolMeaning
hidden_statefinal vector for the last token after all N layers (shape: d)
W_vocabunembedding matrix (shape: d × vocab_size) — one column per vocabulary token
logitsraw unnormalized scores, one per vocabulary token
weight tyingreusing the input embedding matrix as W_vocab (transposed) to reduce parameter count

05 — Softmax + Sampling

Softmax turns raw logits into a probability distribution:

P(token_i) = exp(logit_i) / Σ exp(all logits)
"on" → 0.62 ████████████
"down" → 0.20 ████
"." → 0.11 ██
"a" → 0.05 █
"the" → 0.02 ░

Temperature T — divide logits by T before softmax.

  • T→0 greedy (always pick top) · T=1 normal · T→∞ uniform random

Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Sample only from those.

→ sampled: "on"
→ continuing: "the cat sat on ___"

Terms used here

SymbolMeaning
softmaxconverts any vector of numbers into a probability distribution summing to 1
exp(x)e^x — amplifies differences between logits, making high scores dominate
T (temperature)scalar that controls distribution sharpness. Divide logits by T before softmax
top-pnucleus sampling — restrict candidates to smallest set covering cumulative probability p

∑ Full Picture

"the cat sat"
Token IDs → Embedding (x) + Positional Encoding pos(i)
▼ × N layers (e.g. 32)
┌───────────────────────────────────────┐
│ Self-Attention │
│ Q=x·W_Q, K=x·W_K, V=x·W_V │
│ scores = Q·Kᵀ / √d_k │
│ α = softmax(scores) │
│ out = α · V │
│ Add + LayerNorm │
│ FFN: GELU(x·W₁+b₁)·W₂+b₂ │
│ Add + LayerNorm │
└───────────────────────────────────────┘
Last token hidden_state
logits = hidden_state · W_vocab
P = softmax(logits / T)
Sample → "on" → append → repeat