• Uncategorised
  • 0

Claude’s Prompt Cache: You’re Still Sending All the Tokens

If you’ve looked at your Claude API usage and wondered why input tokens seem  

  suspiciously low — you’re not imagining things. The label is misleading.      

  Here’s what’s actually happening.                                             

  —                                                                           

  The Three Token Buckets                                                       

  Every API call to Claude splits your tokens into three categories:

  ┌─────────────┬────────────────────────────────────────┬────────────────┐     

  │  Category   │                Meaning                 │ Price (approx) │     

  ├─────────────┼────────────────────────────────────────┼────────────────┤     

  │ input       │ Tokens NOT in cache                    │ ~$3/M          │   

  ├─────────────┼────────────────────────────────────────┼────────────────┤

  │ cache_write │ Tokens being cached for the first time │ ~$3.75/M       │

  ├─────────────┼────────────────────────────────────────┼────────────────┤     

  │ cache_read  │ Tokens served from cache               │ ~$0.30/M       │

  └─────────────┴────────────────────────────────────────┴────────────────┘     

  Your total data sent = all three combined. The input label alone doesn’t tell 

  you how many tokens you actually sent.                                      

  —                                                                         

  How the Cache Works

  Every turn, your client sends the full context to Anthropic’s servers —

  conversation history, system prompt, file contents, everything. Anthropic’s   

  server then checks:

  ▎ “Have I seen this exact prefix before, recently (~5 min TTL)?”              

  – Cache hit → tokens billed as cache_read (10x cheaper)                       

  – Cache miss → tokens billed as input or cache_write                        

  The saving is not in bandwidth. All the data travels over the wire every time.

   The saving is in Anthropic’s GPU compute — they don’t re-process tokens

  they’ve already processed recently, and pass that saving on.                  

  —

  A Real Example

  Here’s actual usage from a Claude Code session that did a deep codebase search

   across Java files:                                                           

  claude-haiku-4-5:                                                             

    input:       583 tokens                                                     

    cache_write: 90.7k tokens

    cache_read:  1.0M tokens                                                    

    cost:        $0.24                                                        

  Real tokens sent: ~1.09M. Not 583.                                            

  If those 1.09M tokens had all been regular input at $0.80/M (Haiku input      

  rate), the cost would’ve been ~$0.87. With caching: $0.24. ~3.5x cheaper.   

  —                                                                         

  What Breaks the Cache

  The cache is keyed on exact prefix match. Anything that changes the prefix =

  cache miss:                                                                   

  – You edit a file that’s in the context → miss                                

  – New conversation (TTL expired) → miss, everything becomes cache_write again

  – Different system prompt → miss                                              

  Cache misses aren’t catastrophic — they just cost regular input rates for that

   turn, and re-prime the cache for subsequent turns.

  —             

Exactly the right observation. Let me think through this honestly.            

  Your logic is correct

  First time a file is read:                                                    

  – Content is NEW → should be input tokens OR cache_write tokens               

  – NOT cache_read (nothing to read from yet)                                   

  So first reads should show up as either input or cache_write, not cache_read. 

  The answer: cache_write IS the first read                                     

  When content is new AND gets cached, Anthropic bills it as cache_write, NOT as

   input.         

  First time seen  +  marked for caching  →  cache_write

  First time seen  +  NOT marked          →  input                              

  Seen before      +  in cache            →  cache_read                         

  So your 90.7k cache_write = the file contents being tokenized and cached for  

  the first time. 

  The 583 input = tiny uncached content (like the agent prompt itself, a few    

  instructions).

  So the full picture

  583 input       → small new content NOT cached

  90.7k cache_write → files read for first time, cached                         

  1.0M cache_read  → system prompt + conversation history resent every turn     

  The Practical Takeaway

  When reading your usage stats, ignore input in isolation. The number that

  tells you actual data volume is:                                              

  total_tokens = input + cache_write + cache_read                               

  And if cache_read dominates — that’s a good thing. It means your context is   

  stable across turns and you’re getting the bulk discount on repeated content.

  Cache doesn’t reduce what you send. It reduces what you pay per token for     

  sending it.

  —             

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *