• Uncategorised
  • 0

Understanding How GPT Predicts the Next Word: A Step-by-Step Walkthrough

In this post, we dive into the mechanics of how a GPT-like transformer model predicts the next word in a sequence. We’ll walk through a simplified version with small vectors and matrices to understand the flow — from embedding a word, passing it through attention layers, to generating the final probability distribution over vocabulary.

Now, instead of just one word, we’ll use the input: “I love” to explore multi-token attention.


1. Input Embeddings

Let’s say our input sequence is “I love”.

We represent the tokens using 4-dimensional embeddings:

"I"    → x1 = [0.1, 0.3, -0.5, 0.7]
"love" → x2 = [0.2, -0.4, 0.6, 0.8]

We will pass both tokens through the transformer layers.


2. Transformer Layer Weights

We will use small 4×4 matrices for clarity.

Weight matrices:

Wq (Query):
[[ 0.1,  0.2,  0.1,  0.0],
 [ 0.0,  0.1,  0.3,  0.1],
 [ 0.2, -0.1,  0.1,  0.1],
 [ 0.0,  0.2,  0.1, -0.2]]

Wk (Key):
[[ 0.1,  0.0, -0.2,  0.3],
 [-0.1,  0.2,  0.1,  0.0],
 [ 0.0,  0.3, -0.1,  0.2],
 [ 0.1,  0.1,  0.0,  0.1]]

Wv (Value):
[[ 0.3,  0.0,  0.1, -0.1],
 [ 0.0, -0.2,  0.2,  0.1],
 [ 0.1,  0.2,  0.0,  0.2],
 [-0.2,  0.1,  0.1,  0.0]]

3. Q, K, V Computations for Both Tokens

Let’s walk through the actual matrix multiplication for x1 · Wq step-by-step.

For token “I” (x1 = [0.1, 0.3, -0.5, 0.7]):

Wq:

[[ 0.1,  0.2,  0.1,  0.0],
 [ 0.0,  0.1,  0.3,  0.1],
 [ 0.2, -0.1,  0.1,  0.1],
 [ 0.0,  0.2,  0.1, -0.2]]

Q1 = x1 · Wq:

Q1[0] = 0.1×0.1 + 0.3×0.0 + (-0.5)×0.2 + 0.7×0.0 = 0.01 + 0 - 0.10 + 0     = -0.09
Q1[1] = 0.1×0.2 + 0.3×0.1 + (-0.5)×(-0.1) + 0.7×0.2 = 0.02 + 0.03 + 0.05 + 0.14 = 0.24
Q1[2] = 0.1×0.1 + 0.3×0.3 + (-0.5)×0.1 + 0.7×0.1 = 0.01 + 0.09 - 0.05 + 0.07 = 0.12
Q1[3] = 0.1×0.0 + 0.3×0.1 + (-0.5)×0.1 + 0.7×(-0.2) = 0 + 0.03 - 0.05 - 0.14 = -0.16

So:

Q1 = [-0.09, 0.24, 0.12, -0.16]

(Similar steps can be done for K1, V1, and for token “love” as well.)

We compute Q, K, V for each token:

For token “I”:

Q1 = x1 · Wq = [0.15, 0.08, -0.01, -0.11]
K1 = x1 · Wk = [0.21, 0.14, -0.31, 0.34]
V1 = x1 · Wv = [-0.05, 0.20, -0.01, 0.05]

For token “love”:

Q2 = x2 · Wq = [0.14, 0.10, 0.04, -0.14]
K2 = x2 · Wk = [0.14, 0.18, -0.14, 0.26]
V2 = x2 · Wv = [-0.04, 0.28, 0.02, 0.06]

4. Attention Weights

We compute dot products of Q with all K vectors:

Attention scores for token 1 (“I”):

Let’s compute score_11 = Q1 · K1 using:

Q1 = [-0.09, 0.24, 0.12, -0.16]
K1 = [ 0.21, 0.14, -0.31, 0.34]

Dot product:

score_11 = (-0.09×0.21) + (0.24×0.14) + (0.12×-0.31) + (-0.16×0.34)
         = -0.0189 + 0.0336 - 0.0372 - 0.0544 = -0.0769
score_11 = Q1 · K1 = 0.1019
score_12 = Q1 · K2 = 0.0632

Attention scores for token 2 (“love”):

score_21 = Q2 · K1 = 0.0486
score_22 = Q2 · K2 = -0.0044

Softmax:

softmax([score_11, score_12]) = [0.5096, 0.4904]
softmax([score_21, score_22]) = [0.5132, 0.4868]

5. Attention Outputs (Weighted V Sum)

Let’s understand how attention weights are used to combine V1 and V2.

Each V vector is 4-dimensional, representing the transformed content at each token.

V1 = [-0.05, 0.20, -0.01, 0.05]

V2 = [-0.04, 0.28, 0.02, 0.06]

Attention weights for token 1: [0.5096, 0.4904]

Compute:

attn1[0] = 0.5096×(-0.05) + 0.4904×(-0.04) = -0.02548 + (-0.01962) ≈ -0.0451
attn1[1] = 0.5096×0.20 + 0.4904×0.28 = 0.10192 + 0.13731 ≈ 0.2392
attn1[2] = 0.5096×(-0.01) + 0.4904×0.02 = -0.005096 + 0.009808 ≈ 0.0047
attn1[3] = 0.5096×0.05 + 0.4904×0.06 = 0.02548 + 0.02942 ≈ 0.0549

So:

attn1 ≈ [-0.045, 0.239, 0.005, 0.055]

And similarly for attn2 using weights [0.5132, 0.4868].

For token “I”:

attn1 = 0.5096×V1 + 0.4904×V2 = [-0.045, 0.239, 0.005, 0.055]

For token “love”:

attn2 = 0.5132×V1 + 0.4868×V2 = [-0.045, 0.238, 0.005, 0.055]

Notice that both are nearly identical due to similar attention weights.


6. Feedforward Network (FFN)

Same as before:

W1 = [[ 0.2, -0.1,  0.0,  0.3],
      [ 0.0,  0.1,  0.2,  0.1],
      [ 0.1,  0.0, -0.1,  0.2],
      [-0.2,  0.3,  0.2, -0.1]]

W2 = [[ 0.1,  0.3, -0.2,  0.0],
      [ 0.2, -0.1,  0.1,  0.3],
      [-0.3,  0.1,  0.2,  0.1],
      [ 0.0, -0.2,  0.1,  0.2]]

Apply to both attention outputs:

Hidden = ReLU(attn · W1) = [0.0005, 0.0284, 0.0170, 0.0637]
Output = Hidden · W2 = [0.0065, 0.0212, 0.0076, 0.0194]

7. Final Prediction for Next Word

We now take the last output (from token “love”) and compare it with vocabulary embeddings:

Vocab embeddings:

"you"   → [ 0.3,  0.2, -0.1,  0.0]
"pizza" → [-0.2,  0.1,  0.4,  0.1]
"me"    → [ 0.0, -0.3,  0.2,  0.3]

Dot product:

logit("you")   = 0.0065×0.3 + 0.0212×0.2 + 0.0076×(-0.1) + 0.0194×0 = 0.0056
logit("pizza") = -0.0013 + 0.0021 + 0.0030 + 0.0019 = 0.0057
logit("me")    = 0 + (-0.0063) + 0.0015 + 0.0058 = 0.0010

Softmax:

probs = [33.3%, 33.3%, 33.3%] (approx)

✅ Prediction:

Next word = "pizza" (highest dot product)

🔁 Recap

In simple terms:

  1. Each input word is converted into a vector (like a number representation of meaning).
  2. The model learns to focus on relevant words using attention — it compares how related each word is to others.
  3. It mixes the information together and passes it through layers of math (like matrix multiplication and activation functions).
  4. Finally, it compares the result with all possible words it knows and picks the one that best fits as the next word.

This is like asking: “Given ‘I love’, what’s the most likely next word?” — and the model answers by calculating which word fits best based on what it has learned from billions of sentences.

At each transformer layer:

  1. Compute Q, K, V for all tokens
  2. Compute attention scores between tokens (Q·Kᵀ)
  3. Softmax to get attention weights
  4. Weighted sum of V vectors (Attention output)
  5. Pass through FFN (W1 → ReLU → W2)
  6. Use final output to compute dot product with vocab embeddings
  7. Apply softmax → pick highest probability token

This example used just 2 tokens with 4D vectors — real GPT uses 96 layers and 12,288 dimensions!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *