How agent summarization works (LangChain + Deep Agents)

by marjavamitjava · April 13, 2026

Modern agents keep a chat history of user messages, assistant replies, and tool calls. That history is bounded by a context window. When it gets long, frameworks can compress older turns so the model still has useful signal without sending megabytes of verbatim text. This post walks through how that works in LangChain v1 and how Deep Agents extends it.

The problem in one sentence

You have more conversation than fits comfortably in context; you want to drop or shrink the old part while keeping recent turns intact and not breaking tool-call structure.

Two layers: stock LangChain vs Deep Agents

LangChain ships a SummarizationMiddleware that implements the core idea: measure size → pick a cut → summarize the prefix → replace state with summary + suffix.

Deep Agents adds a sibling module with the same name at the application level but different behavior: deepagents/.../middleware/summarization.py. It wraps LangChain’s logic (SummarizationMiddleware as an internal helper), then adds backend offload, optional truncation of huge tool arguments, wrap_model-style integration, and an optional compact_conversation tool.

If you read only LangChain’s file, you will not find disk offload; that lives in Deep Agents.

When does summarization run?

Both stacks use configurable triggers, expressed as things like:

Messages: “Run when there are at least N messages.”
Tokens: “Run when approximate (or sometimes reported) token count ≥ T.”
Fraction: “Run when usage ≥ f × max_input_tokens” (needs a model profile with max_input_tokens).

Deep Agents’ create_summarization_middleware picks defaults from the model profile (for example ~85% of the window as trigger and ~10% as retention when profile data exists).

There is also a practical escape hatch in LangChain: if the normal call hits a context overflow error, the pipeline can force a summarization path even when counters were slightly shy.

“Where to cut” — the keep window

Summarization is not “summarize the entire thread.” It is summarize the head, keep the tail.

The framework computes a cutoff index in the message list:

Everything before the cutoff → candidate for summarization (the “old” prefix).
Everything from the cutoff onward → kept verbatim (the “recent” window).

Retention (keep) can be defined by last N messages, last T tokens, or last fraction of the max input window. The implementation walks or binary-searches so the suffix fits the policy.

Important detail: The cutoff is adjusted so you do not split an AIMessage that issued tool_calls from the ToolMessage responses that belong to it. Otherwise the model would see orphaned tool results or calls without replies.

How the summary is produced

The messages in the prefix (optionally trimmed to a max token budget for the summarizer call, e.g. a few thousand tokens) are turned into plain text (e.g. get_buffer_string) and passed to a separate LLM call with a structured prompt (LangChain’s default prompt asks for sections like intent, summary, artifacts, next steps). The result is a single new user-facing message (a HumanMessage) that stands in for the whole prefix.

So: one LLM call (or async equivalent) compresses many messages into one.

What LangChain does with state (the simple story)

In the classic LangChain path, the middleware returns an update that removes the old messages and inserts the summary plus the preserved tail (using LangGraph’s RemoveMessage machinery). The evicted text is not written to disk by that middleware alone; it is simply gone from the active message list unless something else (logging, checkpoints, custom code) kept it.

What Deep Agents adds

1. Offload to a backend

When the prefix is about to be summarized, Deep Agents can _offload_to_backend: serialize those messages to markdown and append them to a per-thread file such as:

{artifacts_root}/conversation_history/{thread_id}.md

The backend is pluggable (filesystem, composite routes, sandboxes, etc.) behind a small protocol (read / write / edit / download_files, …). So recovery of raw history is possible from disk (or whatever backend), even though the live chat is compacted.

2. Summary text may mention the path

If offload succeeds, the summary HumanMessage can explicitly say that full history was saved at that path, so the agent (or user) knows where to look. If offload fails, summarization may still proceed, but without that pointer.

3. Logical compaction via `_summarization_event`

The Deep Agents automatic path often uses wrap_model_call: it does not mirror the old “delete every message from graph state” story. Instead it records a _summarization_event (cutoff, summary message, optional file path) and builds the list actually sent to the model from raw state + event. That keeps compaction and tooling integrated with the rest of the agent stack.

4. Truncating huge tool arguments (optional)

Before or alongside full summarization, Deep Agents can shorten very long string arguments on older write_file / edit_file tool calls in the history, because the full payload may already live on disk. That is a cheap win separate from LLM summarization.

5. Manual `compact_conversation`

SummarizationToolMiddleware exposes a tool so the model (or CLI) can compact on demand, with an eligibility gate (roughly “context should be moderately full”) so the tool is not abused when the thread is still tiny.

Second and later compactions

The same {thread_id}.md file is reused; each compaction appends a new timestamped section. The effective chat view gets a new summary message again; the markdown log becomes a running archive of what was evicted each time.

What does not happen automatically

Nothing in this design automatically reads the offload file back into the model context. The usual expectation is: the model is told the path, and if it needs detail, it uses normal file tools to read it. Summarization reduces context by design; re-hydrating everything would defeat that unless you explicitly want it.