Prompt Caching for LLM Apps: A Practical Guide (API-Based)
If you’re building an LLM-powered application (like a code review system), you’ve probably noticed this pattern:
- A large, mostly static prompt (system instructions, guidelines, skill files)
- A small dynamic input (e.g., code diff)
Yet every request reprocesses the entire prompt. That’s wasteful.
Prompt caching fixes this.
This guide explains:
- What prompt caching actually is
- When it works
- How to implement it with APIs
- Common pitfalls
- A concrete design for a code-review app
🧠 What is Prompt Caching (Really)?
When an LLM processes input, it:
- Tokenizes text
- Runs it through transformer layers
- Builds internal attention states (KV cache)
👉 Prompt caching = reusing those internal states for repeated prefixes
So instead of recomputing:
[system + guidelines + skills + diff]
every time, you compute:
[system + guidelines + skills] → cached
and reuse it for:
+ diff
🎯 When Prompt Caching Helps
You get the biggest gains when:
- Large static prefix
- Small dynamic suffix
- High request volume
Example (code review)
System prompt: 2000 tokens
Guidelines: 3000 tokens
Skill files: 5000 tokens
Code diff: 200 tokens
--------------------------------
Total: ~10,200 tokens
Without caching → process all 10K tokens every time
With caching → process ~200 tokens after first request
⚡ Benefits
🚀 Latency
Skip most of the computation → faster responses
💰 Cost
Some providers charge less for cached tokens (or you avoid recomputation internally)
📈 Throughput
Higher QPS (requests/sec)
🧩 Key Rule (Most Important)
👉 Caching only works on a strict prefix
Correct:
[STATIC PREFIX]
[STATIC PREFIX]
----------------
[DYNAMIC INPUT]
Wrong:
[STATIC]
[DYNAMIC]
[STATIC AGAIN] ❌ breaks cache
🏗️ API-Based Prompt Caching (How to Implement)
Modern APIs (like OpenAI-style APIs) support prefix caching hints.
Step 1 — Split your prompt
String staticPrompt = """
You are a senior code reviewer.
Follow these rules:
...
[guidelines]
...
[skills]
""";
String dynamicPrompt = """
Review this code diff:
""" + diff;
Step 2 — Send as structured input
response = client.responses.create({
model: "gpt-4.1",
input: [
{
role: "system",
content: staticPrompt,
cache_control: { type: "ephemeral" } // 👈 enables caching
},
{
role: "user",
content: dynamicPrompt
}
]
});
Step 3 — What happens internally
- First request:
- staticPrompt → processed + cached
- Next requests:
- same staticPrompt → reused
- only dynamicPrompt processed
🧠 Important: Cache Scope
✔️ What does NOT matter
- Client object reuse
- Thread
- Instance
✔️ What DOES matter
- Exact prompt text
- Same model
- Same structure
👉 Cache is server-side, not in your app
⚠️ Common Pitfalls
1. Tiny changes break cache
"Review code\n" vs "Review code \n"
👉 Even whitespace differences → cache miss
2. Dynamic data inside static block
String staticPrompt = "Time: " + now + guidelines;
❌ breaks caching
3. Reordering sections
[guidelines][skills] vs [skills][guidelines]
❌ different prefix → no cache
4. Mixing dynamic content in the middle
[system][diff][guidelines] ❌
🧠 Advanced Design for Code Review Apps
1. Build a stable prefix
staticPrompt = loadSystem() + loadGuidelines() + loadSkills();
2. Hash it (optional, for tracking)
cacheKey = hash(staticPrompt);
3. Keep dynamic input minimal
dynamicPrompt = formatDiff(diff);
4. Call model
callLLM(staticPrompt, dynamicPrompt);
🧩 Scaling Strategy
Multi-repo systems
Different repos → different guidelines
👉 Cache per repo:
cacheKey = hash(repo + guidelines + skills)
Warm cache (optional)
For frequently used configs:
- Send a dummy request at startup
- Pre-populate cache
Combine with RAG
Instead of loading all skills:
- Retrieve only relevant ones
- Cache common combinations
📊 Expected Impact
| Scenario | Latency |
|---|---|
| No cache | 2–5 seconds |
| With cache | ~0.5–1.5 seconds |
🧠 Mental Model
👉 Prompt caching is NOT:
- Response caching ❌
- Storing outputs ❌
👉 It IS:
- Reusing internal computation ✔️
🏆 When NOT to Use It
- Highly dynamic prompts
- Small prompts (<1k tokens)
- One-off requests
✅ Summary
Prompt caching is one of the highest ROI optimizations for LLM apps.
To implement it:
- Separate static vs dynamic prompt
- Ensure static is a strict prefix
- Use API caching hints (
cache_control) - Keep prefix identical across requests
👉 For code review systems, it can reduce:
- Latency by ~70–90%
- Cost significantly
🚀 Final Thought
If your app repeatedly sends large prompts, not using prompt caching is like:
recompiling your entire codebase on every request
Fixing it is simple—and the payoff is huge.