Understanding How GPT Predicts the Next Word: A Step-by-Step Walkthrough
In this post, we dive into the mechanics of how a GPT-like transformer model predicts the next word in a sequence. We’ll walk through a simplified version with small vectors and matrices to understand the flow — from embedding a word, passing it through attention layers, to generating the final probability distribution over vocabulary.
Now, instead of just one word, we’ll use the input: “I love” to explore multi-token attention.
1. Input Embeddings
Let’s say our input sequence is “I love”.
We represent the tokens using 4-dimensional embeddings:
"I" → x1 = [0.1, 0.3, -0.5, 0.7]
"love" → x2 = [0.2, -0.4, 0.6, 0.8]
We will pass both tokens through the transformer layers.
2. Transformer Layer Weights
We will use small 4×4 matrices for clarity.
Weight matrices:
Wq (Query):
[[ 0.1, 0.2, 0.1, 0.0],
[ 0.0, 0.1, 0.3, 0.1],
[ 0.2, -0.1, 0.1, 0.1],
[ 0.0, 0.2, 0.1, -0.2]]
Wk (Key):
[[ 0.1, 0.0, -0.2, 0.3],
[-0.1, 0.2, 0.1, 0.0],
[ 0.0, 0.3, -0.1, 0.2],
[ 0.1, 0.1, 0.0, 0.1]]
Wv (Value):
[[ 0.3, 0.0, 0.1, -0.1],
[ 0.0, -0.2, 0.2, 0.1],
[ 0.1, 0.2, 0.0, 0.2],
[-0.2, 0.1, 0.1, 0.0]]
3. Q, K, V Computations for Both Tokens
Let’s walk through the actual matrix multiplication for x1 · Wq
step-by-step.
For token “I” (x1 = [0.1, 0.3, -0.5, 0.7]):
Wq:
[[ 0.1, 0.2, 0.1, 0.0],
[ 0.0, 0.1, 0.3, 0.1],
[ 0.2, -0.1, 0.1, 0.1],
[ 0.0, 0.2, 0.1, -0.2]]
Q1 = x1 · Wq:
Q1[0] = 0.1×0.1 + 0.3×0.0 + (-0.5)×0.2 + 0.7×0.0 = 0.01 + 0 - 0.10 + 0 = -0.09
Q1[1] = 0.1×0.2 + 0.3×0.1 + (-0.5)×(-0.1) + 0.7×0.2 = 0.02 + 0.03 + 0.05 + 0.14 = 0.24
Q1[2] = 0.1×0.1 + 0.3×0.3 + (-0.5)×0.1 + 0.7×0.1 = 0.01 + 0.09 - 0.05 + 0.07 = 0.12
Q1[3] = 0.1×0.0 + 0.3×0.1 + (-0.5)×0.1 + 0.7×(-0.2) = 0 + 0.03 - 0.05 - 0.14 = -0.16
So:
Q1 = [-0.09, 0.24, 0.12, -0.16]
(Similar steps can be done for K1, V1, and for token “love” as well.)
We compute Q, K, V for each token:
For token “I”:
Q1 = x1 · Wq = [0.15, 0.08, -0.01, -0.11]
K1 = x1 · Wk = [0.21, 0.14, -0.31, 0.34]
V1 = x1 · Wv = [-0.05, 0.20, -0.01, 0.05]
For token “love”:
Q2 = x2 · Wq = [0.14, 0.10, 0.04, -0.14]
K2 = x2 · Wk = [0.14, 0.18, -0.14, 0.26]
V2 = x2 · Wv = [-0.04, 0.28, 0.02, 0.06]
4. Attention Weights
We compute dot products of Q with all K vectors:
Attention scores for token 1 (“I”):
Let’s compute score_11 = Q1 · K1 using:
Q1 = [-0.09, 0.24, 0.12, -0.16]
K1 = [ 0.21, 0.14, -0.31, 0.34]
Dot product:
score_11 = (-0.09×0.21) + (0.24×0.14) + (0.12×-0.31) + (-0.16×0.34)
= -0.0189 + 0.0336 - 0.0372 - 0.0544 = -0.0769
score_11 = Q1 · K1 = 0.1019
score_12 = Q1 · K2 = 0.0632
Attention scores for token 2 (“love”):
score_21 = Q2 · K1 = 0.0486
score_22 = Q2 · K2 = -0.0044
Softmax:
softmax([score_11, score_12]) = [0.5096, 0.4904]
softmax([score_21, score_22]) = [0.5132, 0.4868]
5. Attention Outputs (Weighted V Sum)
Let’s understand how attention weights are used to combine V1 and V2.
Each V vector is 4-dimensional, representing the transformed content at each token.
V1 = [-0.05, 0.20, -0.01, 0.05]
V2 = [-0.04, 0.28, 0.02, 0.06]
Attention weights for token 1: [0.5096, 0.4904]
Compute:
attn1[0] = 0.5096×(-0.05) + 0.4904×(-0.04) = -0.02548 + (-0.01962) ≈ -0.0451
attn1[1] = 0.5096×0.20 + 0.4904×0.28 = 0.10192 + 0.13731 ≈ 0.2392
attn1[2] = 0.5096×(-0.01) + 0.4904×0.02 = -0.005096 + 0.009808 ≈ 0.0047
attn1[3] = 0.5096×0.05 + 0.4904×0.06 = 0.02548 + 0.02942 ≈ 0.0549
So:
attn1 ≈ [-0.045, 0.239, 0.005, 0.055]
And similarly for attn2
using weights [0.5132, 0.4868].
For token “I”:
attn1 = 0.5096×V1 + 0.4904×V2 = [-0.045, 0.239, 0.005, 0.055]
For token “love”:
attn2 = 0.5132×V1 + 0.4868×V2 = [-0.045, 0.238, 0.005, 0.055]
Notice that both are nearly identical due to similar attention weights.
6. Feedforward Network (FFN)
Same as before:
W1 = [[ 0.2, -0.1, 0.0, 0.3],
[ 0.0, 0.1, 0.2, 0.1],
[ 0.1, 0.0, -0.1, 0.2],
[-0.2, 0.3, 0.2, -0.1]]
W2 = [[ 0.1, 0.3, -0.2, 0.0],
[ 0.2, -0.1, 0.1, 0.3],
[-0.3, 0.1, 0.2, 0.1],
[ 0.0, -0.2, 0.1, 0.2]]
Apply to both attention outputs:
Hidden = ReLU(attn · W1) = [0.0005, 0.0284, 0.0170, 0.0637]
Output = Hidden · W2 = [0.0065, 0.0212, 0.0076, 0.0194]
7. Final Prediction for Next Word
We now take the last output (from token “love”) and compare it with vocabulary embeddings:
Vocab embeddings:
"you" → [ 0.3, 0.2, -0.1, 0.0]
"pizza" → [-0.2, 0.1, 0.4, 0.1]
"me" → [ 0.0, -0.3, 0.2, 0.3]
Dot product:
logit("you") = 0.0065×0.3 + 0.0212×0.2 + 0.0076×(-0.1) + 0.0194×0 = 0.0056
logit("pizza") = -0.0013 + 0.0021 + 0.0030 + 0.0019 = 0.0057
logit("me") = 0 + (-0.0063) + 0.0015 + 0.0058 = 0.0010
Softmax:
probs = [33.3%, 33.3%, 33.3%] (approx)
✅ Prediction:
Next word = "pizza" (highest dot product)
🔁 Recap
In simple terms:
- Each input word is converted into a vector (like a number representation of meaning).
- The model learns to focus on relevant words using attention — it compares how related each word is to others.
- It mixes the information together and passes it through layers of math (like matrix multiplication and activation functions).
- Finally, it compares the result with all possible words it knows and picks the one that best fits as the next word.
This is like asking: “Given ‘I love’, what’s the most likely next word?” — and the model answers by calculating which word fits best based on what it has learned from billions of sentences.
At each transformer layer:
- Compute Q, K, V for all tokens
- Compute attention scores between tokens (Q·Kᵀ)
- Softmax to get attention weights
- Weighted sum of V vectors (Attention output)
- Pass through FFN (W1 → ReLU → W2)
- Use final output to compute dot product with vocab embeddings
- Apply softmax → pick highest probability token
This example used just 2 tokens with 4D vectors — real GPT uses 96 layers and 12,288 dimensions!