Mar 17, 2026 · 10 min read · -- views

Transformer Architecture

first principles · dense walkthrough · toy numbers

00 — Tokenization + Embedding

Text is split into tokens. Each maps to an integer ID, then looked up in an embedding table to get a dense vector.

sentence:  "the cat sat"

token IDs: the→4, cat→912, sat→3041

embedding lookup (d=4, toy):
  the  →  [ 0.2,  0.8, -0.1,  0.5]
  cat  →  [ 0.9, -0.3,  0.7,  0.1]
  sat  →  [-0.4,  0.6,  0.2, -0.9]

+ Positional Encoding — transformers have no built-in sense of order. A positional vector is added to each embedding:

the  [ 0.2,  0.8, -0.1,  0.5]   ← token embed
   + [ 0.0,  1.0,  0.0,  1.0]   ← pos(0)
   = [ 0.2,  1.8, -0.1,  1.5]   ← final input vector for "the"

Terms used here

Symbol	Meaning
`d`	embedding dimension (size of each token vector). Toy=4, real=4096
`x`	a single token’s vector (one row of the embedding table)
`pos(i)`	positional encoding added to token at position i

01 — Self-Attention

Every token asks: “who should I pay attention to?”

Step 1 — project each token into Q, K, V

For every token in the sequence, multiply its vector x by three learned weight matrices:

Q = x · W_Q    ← "what am I looking for?"
K = x · W_K    ← "what do I offer to others?"
V = x · W_V    ← "what information do I carry?"

This is done for every token, giving us three matrices (one row per token):

         Q matrix       K matrix       V matrix
the  →   [q vectors]   [k vectors]   [v vectors]
cat  →   [q vectors]   [k vectors]   [v vectors]
sat  →   [q vectors]   [k vectors]   [v vectors]

Q_i means: the Q vector of token i (row i of the Q matrix). K_j means: the K vector of token j (row j of the K matrix).

Step 2 — compute attention scores between every pair of tokens

How much should token i attend to token j? Dot the query of i against the key of j, then scale:

score(i, j) = Q_i · K_j ᵀ / √d_k

d_k = 4 (toy), so √d_k = √4 = 2

toy example — "sat" (i=2) attending to all tokens (j=0,1,2):
  score(sat, the) = Q_sat · K_the / 2  =  1.4 / 2  =  0.70
  score(sat, cat) = Q_sat · K_cat / 2  =  4.2 / 2  =  2.10   ← highest
  score(sat, sat) = Q_sat · K_sat / 2  =  2.6 / 2  =  1.30

The /√d_k scaling prevents dot products from getting too large and making softmax collapse to a one-hot. Without it, larger d_k means larger raw dot products — the scaling keeps scores in a stable range regardless of model size.

Causal masking — in decoder-only models (GPT-style), each token can only attend to itself and earlier tokens. Future positions are masked to −∞ before softmax, so their attention weight becomes zero. The example above works because “sat” only attends to tokens that came before it or at the same position.

Step 3 — softmax to get attention weights

softmax over [0.7, 2.1, 1.3]:
  α = [0.15, 0.59, 0.27]   ← attention weights, sum=1

Step 4 — weighted sum of Value vectors

output_sat = α_0·V_the + α_1·V_cat + α_2·V_sat
           = 0.15·V_the + 0.59·V_cat + 0.27·V_sat
           = [ 0.54, -0.11,  0.63,  0.08]   ← enriched "sat"

“sat” now carries information about “cat” because it attended to it heavily. Real models run 32–64 of these attention heads in parallel, each with its own W_Q, W_K, W_V.

Terms used here

Symbol	Meaning
`W_Q, W_K, W_V`	learned weight matrices that project x into Q, K, V spaces
`Q`	matrix of query vectors — one row per token, computed as x · W_Q for each x
`K`	matrix of key vectors — one row per token, computed as x · W_K for each x
`V`	matrix of value vectors — one row per token, computed as x · W_V for each x
`Q_i`	query vector of token i (row i of Q)
`K_j`	key vector of token j (row j of K)
`V_j`	value vector of token j (row j of V)
`d_k`	dimension of the Q/K vectors — used for scaling to prevent large dot products
`score(i,j)`	raw attention score: how much token i should look at token j
`α`	attention weights after softmax — how much of each V to mix in

02 — Feed-Forward Network (FFN)

After attention, each token vector passes through an FFN independently — no cross-token communication, just transformation.

input (d=4):     [ 0.54, -0.11,  0.63,  0.08]

step 1 — expand (d → 4d = 16):
  h = GELU(input · W₁ + b₁)
  h = [ 0.0,  1.2,  0.0,  0.7, ...]   ← sparse activations

step 2 — contract (4d → d):
  output = h · W₂ + b₂
  output = [ 0.31,  0.77, -0.22,  0.45]

The FFN is where factual knowledge is stored. Different neurons fire for different concepts. Real dimensions: 4096 → 16384 → 4096, repeated across every layer.

Terms used here

Symbol	Meaning
`W₁, b₁`	learned weights and bias for the expand step (shape: d × 4d)
`W₂, b₂`	learned weights and bias for the contract step (shape: 4d × d)
`h`	intermediate hidden state in the expanded dimension
`GELU`	activation function — introduces non-linearity so stacked linear layers can’t collapse into one
`4d`	expanded dimension (4× model dimension) — gives the network more “workspace”

03 — Residual Connection + LayerNorm

Two structural components that stabilize both training and inference. They wrap every attention block and every FFN block:

Attention → Add+Norm → FFN → Add+Norm

Add+Norm after Attention (using “sat” from section 01)

x            = [-0.40,  0.60,  0.20, -0.90]   <- "sat" input to this layer (from §00)
attention(x) = [ 0.54, -0.11,  0.63,  0.08]   <- attention output (from §01)

residual:
  x + attention(x) = [-0.40+0.54,  0.60-0.11,  0.20+0.63, -0.90+0.08]
                   = [ 0.14,  0.49,  0.83, -0.82]

layernorm:
  mean     = (0.14 + 0.49 + 0.83 - 0.82) / 4  =  0.16
  variance = mean of squared deviations         =  0.38
  std      = sqrt(0.38)                        ~=  0.62

  normalized = (each value - mean) / std:
    (0.14 - 0.16) / 0.62  =  -0.03
    (0.49 - 0.16) / 0.62  =   0.53
    (0.83 - 0.16) / 0.62  =   1.08
   (-0.82 - 0.16) / 0.62  =  -1.58

  -> [ -0.03,  0.53,  1.08, -1.58]   <- input to FFN

Add+Norm after FFN (continuing from above)

x_ffn_in  = [-0.03,  0.53,  1.08, -1.58]   <- input to FFN (from above)
ffn(x)    = [ 0.31,  0.77, -0.22,  0.45]   <- FFN output (from §02)

residual:
  x_ffn_in + ffn(x) = [-0.03+0.31,  0.53+0.77,  1.08-0.22, -1.58+0.45]
                    = [ 0.28,  1.30,  0.86, -1.13]

layernorm:
  mean  = (0.28 + 1.30 + 0.86 - 1.13) / 4  =  0.33
  std  ~=  0.92

  normalized:
    (0.28 - 0.33) / 0.92  =  -0.05
    (1.30 - 0.33) / 0.92  =   1.06
    (0.86 - 0.33) / 0.92  =   0.58
   (-1.13 - 0.33) / 0.92  =  -1.59

  -> [ -0.05,  1.06,  0.58, -1.59]   <- output of this full transformer layer

This output feeds into the next layer’s attention as x, and the process repeats N times.

LayerNorm vs Softmax — both “normalize” but for different purposes:

softmax turns scores into a probability distribution: all positive, sum to 1. Used when choosing between options (attention weights, token probabilities).

LayerNorm re-centers and rescales a vector: values can still be negative, do not sum to 1. Used to keep activations in a stable numerical range between layers.
softmax([0.7, 2.1, 1.3])   -> [ 0.15,  0.59,  0.27]   <- probability distribution
layernorm([0.7, 2.1, 1.3]) -> [-1.16,  1.28, -0.12]   <- same shape, re-centered
Softmax answers “which one?” — LayerNorm answers “keep this well-behaved.”

Terms used here

Symbol	Meaning
`x`	input to the block — the token vector before attention or FFN
`layer(x)`	output of the attention or FFN block
`x + layer(x)`	residual: original input added back before normalizing
`mean, std`	computed across the 4 dimensions of the vector (not across tokens)
`LN`	LayerNorm — subtract mean, divide by std, then scale/shift by learned γ, β

04 — LM Head → Logits

After N transformer layers, take the last token’s hidden state. Project it to the size of the full vocabulary:

logits = hidden_state · W_vocab

last hidden state (d=4):
  [-0.05,  1.06,  0.58, -1.59]

raw logits (toy vocab of 6):
  "."    →  2.1
  "on"   →  3.8   ← highest
  "down" →  2.7
  "the"  →  0.4
  "a"    →  1.3
  "and"  → -1.1

W_vocab is usually weight-tied to the input embedding matrix (transposed). Saves billions of parameters.

Terms used here

Symbol	Meaning
`hidden_state`	final vector for the last token after all N layers (shape: d)
`W_vocab`	unembedding matrix (shape: d × vocab_size) — one column per vocabulary token
`logits`	raw unnormalized scores, one per vocabulary token
weight tying	reusing the input embedding matrix as W_vocab (transposed) to reduce parameter count

05 — Softmax + Sampling

Softmax turns raw logits into a probability distribution:

P(token_i) = exp(logit_i) / Σ exp(all logits)

"on"   → 0.62  ████████████
"down" → 0.20  ████
"."    → 0.11  ██
"a"    → 0.05  █
"the"  → 0.02  ░

Temperature T — divide logits by T before softmax.

T→0 greedy (always pick top) · T=1 normal · T→∞ uniform random

Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Sample only from those.

→ sampled: "on"
→ continuing: "the cat sat on ___"

Terms used here

Symbol	Meaning
`softmax`	converts any vector of numbers into a probability distribution summing to 1
`exp(x)`	e^x — amplifies differences between logits, making high scores dominate
`T` (temperature)	scalar that controls distribution sharpness. Divide logits by T before softmax
`top-p`	nucleus sampling — restrict candidates to smallest set covering cumulative probability p

∑ Full Picture

"the cat sat"
      │
      ▼
Token IDs → Embedding (x) + Positional Encoding pos(i)
      │
      ▼  × N layers (e.g. 32)
┌───────────────────────────────────────┐
│  Self-Attention                       │
│    Q=x·W_Q, K=x·W_K, V=x·W_V         │
│    scores = Q·Kᵀ / √d_k              │
│    α = softmax(scores)                │
│    out = α · V                        │
│  Add + LayerNorm                      │
│  FFN: GELU(x·W₁+b₁)·W₂+b₂           │
│  Add + LayerNorm                      │
└───────────────────────────────────────┘
      │
      ▼
Last token hidden_state
      │
      ▼
logits = hidden_state · W_vocab
      │
      ▼
P = softmax(logits / T)
      │
      ▼
Sample → "on"  →  append  →  repeat

Edit on GitHub