• Uncategorised

How Large Language Models Reach Billions of Parameters — Explained with Tiny Numbers

When you hear a model like GPT-3 has 175 billion parameters, it sounds huge — and it is. But what does that actually mean? Let’s break it down with a super simple example, then scale up to real numbers.


🔹 Step 1: A Tiny Transformer Model

Let’s design a tiny Transformer for learning purposes:

ParameterValue
Vocabulary size (V)100
Sequence length (L)10
Embedding size (E)5
Layers2
Attention heads1
Feedforward size10

🔸 What Do These Mean?

  • Vocabulary size (V): Number of unique tokens the model can understand. Each token has its own learned vector (embedding).
  • Sequence length (L): How many tokens the model can look at in one pass (like a sliding window of text).
  • Embedding size (E): Size of the vector that represents each token.
  • Feedforward size: Hidden layer width in the MLP part of the transformer block.

🔸 Parameter Count Breakdown

1. Embeddings

  • Word embeddings: V × E = 100 × 5 = 500
  • Positional embeddings: L × E = 10 × 5 = 50

Total: 550

2. Self-Attention (Per Layer)

Each layer has 4 weight matrices:

  • WQ,WK,WV,WO : each of size E × E = 5 × 5 = 25

Total per layer = 4 × 25 = 100
Total for 2 layers = 2 × 100 = 200

3. Feedforward Network (Per Layer)

  • First linear: E × FF = 5 × 10 = 50
  • Second linear: FF × E = 10 × 5 = 50

Total per layer = 100
Total for 2 layers = 200

4. LayerNorm and Biases

Roughly 10 params per layer × 2 = 20


✅ Grand Total (Tiny Model)

ComponentParameters
Embeddings550
Attention200
Feedforward200
LayerNorm/etc.20
Total970

🔹 Step 2: How GPT-3 Has 175 Billion Parameters

Let’s now scale up to GPT-3-style values:

ComponentValue
Layers96
Embedding size (E)12,288
FF size49,152
Attention heads96
Vocabulary size (V)~50,000

Estimated Parameters:

  • Embeddings: ~614M
  • Attention: ~58B
  • Feedforward: ~144B
  • Other (LayerNorm, biases): Few million

Total ≈ 175 Billion

You may also like...