Claude’s Prompt Cache: You’re Still Sending All the Tokens
If you’ve looked at your Claude API usage and wondered why input tokens seem
suspiciously low — you’re not imagining things. The label is misleading.
Here’s what’s actually happening.
—
The Three Token Buckets
Every API call to Claude splits your tokens into three categories:
┌─────────────┬────────────────────────────────────────┬────────────────┐
│ Category │ Meaning │ Price (approx) │
├─────────────┼────────────────────────────────────────┼────────────────┤
│ input │ Tokens NOT in cache │ ~$3/M │
├─────────────┼────────────────────────────────────────┼────────────────┤
│ cache_write │ Tokens being cached for the first time │ ~$3.75/M │
├─────────────┼────────────────────────────────────────┼────────────────┤
│ cache_read │ Tokens served from cache │ ~$0.30/M │
└─────────────┴────────────────────────────────────────┴────────────────┘
Your total data sent = all three combined. The input label alone doesn’t tell
you how many tokens you actually sent.
—
How the Cache Works
Every turn, your client sends the full context to Anthropic’s servers —
conversation history, system prompt, file contents, everything. Anthropic’s
server then checks:
▎ “Have I seen this exact prefix before, recently (~5 min TTL)?”
– Cache hit → tokens billed as cache_read (10x cheaper)
– Cache miss → tokens billed as input or cache_write
The saving is not in bandwidth. All the data travels over the wire every time.
The saving is in Anthropic’s GPU compute — they don’t re-process tokens
they’ve already processed recently, and pass that saving on.
—
A Real Example
Here’s actual usage from a Claude Code session that did a deep codebase search
across Java files:
claude-haiku-4-5:
input: 583 tokens
cache_write: 90.7k tokens
cache_read: 1.0M tokens
cost: $0.24
Real tokens sent: ~1.09M. Not 583.
If those 1.09M tokens had all been regular input at $0.80/M (Haiku input
rate), the cost would’ve been ~$0.87. With caching: $0.24. ~3.5x cheaper.
—
What Breaks the Cache
The cache is keyed on exact prefix match. Anything that changes the prefix =
cache miss:
– You edit a file that’s in the context → miss
– New conversation (TTL expired) → miss, everything becomes cache_write again
– Different system prompt → miss
Cache misses aren’t catastrophic — they just cost regular input rates for that
turn, and re-prime the cache for subsequent turns.
—
Exactly the right observation. Let me think through this honestly.
Your logic is correct
First time a file is read:
– Content is NEW → should be input tokens OR cache_write tokens
– NOT cache_read (nothing to read from yet)
So first reads should show up as either input or cache_write, not cache_read.
The answer: cache_write IS the first read
When content is new AND gets cached, Anthropic bills it as cache_write, NOT as
input.
First time seen + marked for caching → cache_write
First time seen + NOT marked → input
Seen before + in cache → cache_read
So your 90.7k cache_write = the file contents being tokenized and cached for
the first time.
The 583 input = tiny uncached content (like the agent prompt itself, a few
instructions).
So the full picture
583 input → small new content NOT cached
90.7k cache_write → files read for first time, cached
1.0M cache_read → system prompt + conversation history resent every turn
—
The Practical Takeaway
When reading your usage stats, ignore input in isolation. The number that
tells you actual data volume is:
total_tokens = input + cache_write + cache_read
And if cache_read dominates — that’s a good thing. It means your context is
stable across turns and you’re getting the bulk discount on repeated content.
Cache doesn’t reduce what you send. It reduces what you pay per token for
sending it.
—