How Large Language Models Reach Billions of Parameters — Explained with Tiny Numbers

by marjavamitjava · July 22, 2025

When you hear a model like GPT-3 has 175 billion parameters, it sounds huge — and it is. But what does that actually mean? Let’s break it down with a super simple example, then scale up to real numbers.

🔹 Step 1: A Tiny Transformer Model

Let’s design a tiny Transformer for learning purposes:

Parameter	Value
Vocabulary size (V)	100
Sequence length (L)	10
Embedding size (E)	5
Layers	2
Attention heads	1
Feedforward size	10

🔸 What Do These Mean?

Vocabulary size (V): Number of unique tokens the model can understand. Each token has its own learned vector (embedding).
Sequence length (L): How many tokens the model can look at in one pass (like a sliding window of text).
Embedding size (E): Size of the vector that represents each token.
Feedforward size: Hidden layer width in the MLP part of the transformer block.

🔸 Parameter Count Breakdown

1. Embeddings

Word embeddings: V × E = 100 × 5 = 500
Positional embeddings: L × E = 10 × 5 = 50

✅ Total: 550

2. Self-Attention (Per Layer)

Each layer has 4 weight matrices:

WQ,WK,WV,WO : each of size E × E = 5 × 5 = 25

Total per layer = 4 × 25 = 100
Total for 2 layers = 2 × 100 = 200

3. Feedforward Network (Per Layer)

First linear: E × FF = 5 × 10 = 50
Second linear: FF × E = 10 × 5 = 50

Total per layer = 100
Total for 2 layers = 200

4. LayerNorm and Biases

Roughly 10 params per layer × 2 = 20

✅ Grand Total (Tiny Model)

Component	Parameters
Embeddings	550
Attention	200
Feedforward	200
LayerNorm/etc.	20
Total	970

🔹 Step 2: How GPT-3 Has 175 Billion Parameters

Let’s now scale up to GPT-3-style values:

Component	Value
Layers	96
Embedding size (E)	12,288
FF size	49,152
Attention heads	96
Vocabulary size (V)	~50,000

Estimated Parameters:

Embeddings: ~614M
Attention: ~58B
Feedforward: ~144B
Other (LayerNorm, biases): Few million

✅ Total ≈ 175 Billion