How Large Language Models Reach Billions of Parameters — Explained with Tiny Numbers
When you hear a model like GPT-3 has 175 billion parameters, it sounds huge — and it is. But what does that actually mean? Let’s break it down with a super simple example, then scale up to real numbers.
🔹 Step 1: A Tiny Transformer Model
Let’s design a tiny Transformer for learning purposes:
| Parameter | Value |
|---|---|
| Vocabulary size (V) | 100 |
| Sequence length (L) | 10 |
| Embedding size (E) | 5 |
| Layers | 2 |
| Attention heads | 1 |
| Feedforward size | 10 |
🔸 What Do These Mean?
- Vocabulary size (V): Number of unique tokens the model can understand. Each token has its own learned vector (embedding).
- Sequence length (L): How many tokens the model can look at in one pass (like a sliding window of text).
- Embedding size (E): Size of the vector that represents each token.
- Feedforward size: Hidden layer width in the MLP part of the transformer block.
🔸 Parameter Count Breakdown
1. Embeddings
- Word embeddings:
V × E = 100 × 5 = 500 - Positional embeddings:
L × E = 10 × 5 = 50
✅ Total: 550
2. Self-Attention (Per Layer)
Each layer has 4 weight matrices:
- WQ,WK,WV,WO : each of size
E × E = 5 × 5 = 25
Total per layer = 4 × 25 = 100
Total for 2 layers = 2 × 100 = 200
3. Feedforward Network (Per Layer)
- First linear:
E × FF = 5 × 10 = 50 - Second linear:
FF × E = 10 × 5 = 50
Total per layer = 100
Total for 2 layers = 200
4. LayerNorm and Biases
Roughly 10 params per layer × 2 = 20
✅ Grand Total (Tiny Model)
| Component | Parameters |
|---|---|
| Embeddings | 550 |
| Attention | 200 |
| Feedforward | 200 |
| LayerNorm/etc. | 20 |
| Total | 970 |
🔹 Step 2: How GPT-3 Has 175 Billion Parameters
Let’s now scale up to GPT-3-style values:
| Component | Value |
|---|---|
| Layers | 96 |
| Embedding size (E) | 12,288 |
| FF size | 49,152 |
| Attention heads | 96 |
| Vocabulary size (V) | ~50,000 |
Estimated Parameters:
- Embeddings:
~614M - Attention:
~58B - Feedforward:
~144B - Other (LayerNorm, biases): Few million
✅ Total ≈ 175 Billion