A 22-staff recruitment firm builds an AI candidate-screening assistant. Each conversation runs thirty to fifty turns. Gather background, probe experience, walk through scenarios, draft a recommendation. Week one looks great. The model is sharp, the conversations feel natural, the cost-per-screen is small enough to ignore. Week six the bill triples and nobody can immediately say why.
The why is in the loop. By turn thirty, the assistant is sending around 50,000 tokens of accumulated history with every new question, because the team built the simplest possible thing: re-ship the whole transcript on each call. The cost curve is not the only issue. The privacy policy does not name AI logging, retention is indefinite, and the conversation logs contain candidate personal data. Both problems sit one architectural decision away. The decision has a name.
What is conversation history management?
Conversation history management is the layer that decides what to send to the model on each turn, what to summarise, what to drop, and where to store it. Large language models are stateless. They do not remember earlier exchanges. The application maintains the illusion of memory by capturing each message, retrieving relevant prior turns from storage, and packaging that bundle with the new question before calling the model.
That cycle runs silently behind every multi-turn chat. From the user’s seat, the conversation flows. Behind the scenes there is no memory, only repeated reminders, and somebody is paying for every token of that reminder on every turn.
Vendors implement the layer differently. OpenAI’s Threads store the transcript on their servers and you reference an ID. Anthropic’s Conversation object hands the application explicit control. Google Dialogflow CX persists state in Sessions for thirty minutes by default. The architectural choice is not academic. It decides where your data lives, who is responsible for retention, and how predictable your monthly bill becomes once conversations stretch.
Why the cost curve bends quadratically and quality degrades
In a naive loop, each turn ships the whole transcript again. A measured 10-step coding agent consumed 472,500 input tokens against 9,000 for a single-pass equivalent, a 43x multiplier on the same business outcome. Input tokens dominate spend in agent loops, accounting for around 54 per cent of total cost, precisely because historical context is being rebilled on every turn rather than computed fresh.
Quality follows the same shape. Research from Microsoft and Salesforce in May 2025 found multi-turn performance dropped 39 per cent on average against single-turn baselines. Worse, when a model makes an early mistake it tends to double down rather than recover. By turn ten the response can be twice as long as it should be and fundamentally misaligned with what the user asked. The remedy is better-curated history rather than more of it. Vendors implementing intelligent memory systems report 26 per cent better response quality alongside 80 to 90 per cent token reduction, which suggests the two metrics move together rather than trading off.
Where you will meet it in the wild
You will meet conversation history management on every multi-turn AI vendor’s pricing or architecture page, dressed in different clothes. OpenAI’s Assistants API uses Threads, server-side conversation containers where you reference an ID and the platform handles compression. Anthropic’s Conversation object lets your application manage messages directly while providing the plumbing. Google Dialogflow CX persists state in Sessions, defaulting to thirty minutes and extendable to twenty-four hours.
Microsoft Agent Framework offers two storage models, service-managed (state on the vendor’s side, you reference an ID) and client-managed (you keep the history locally and ship it on each request). Emerging platforms like Mem0 sit on top, extracting key facts from conversations and discarding raw transcripts. Each pattern moves the cost-versus-control trade-off to a different point. The procurement question worth asking is “where does my history live, how long is it kept, can I delete it on demand, and who can access it?” If the vendor cannot answer cleanly, treat that as the answer.
Thread leakage is the failure mode owners rarely hear about until it bites someone else. In March 2023 OpenAI suffered a glitch that exposed conversation titles to other users’ browser sidebars. The content was not visible, but a title like “Medical diagnosis discussion” or “Legal advice on contract” is itself sensitive metadata. For SMEs running multi-tenant platforms, the diligence question is how your conversation data is isolated from other customers’ data, what access controls govern who can retrieve it, and whether the vendor offers UK data residency or encryption at rest. If the vendor offers vague reassurance instead of documentation, the risk is real.
When to plan compaction versus run a window
For short transactional conversations of three to ten turns, a sliding window is usually fine. Keep the last N messages, drop the rest, accept that the AI will lose context from the start of long conversations. For sales conversations, multi-day research, or anything spanning sessions, plan for summarisation or vector retrieval from day one. For high-volume customer-facing chatbots, hierarchical memory or an intelligent-memory layer earns its complexity within months, not years.
The five compaction strategies in plain English: sliding window keeps the last N messages, fast and limited; token truncation cuts based on a token count, same limitation, smarter at the boundary; summarisation compresses earlier turns into a paragraph, retains gist at the cost of nuance; vector retrieval stores past messages as embeddings and pulls only the relevant ones into each new turn; hierarchical memory organises into layers (domain summary, category summary, specific traces) and traverses top-down.
For any conversation containing personal data, the retention schedule is a day-zero requirement, not a retrofit. GDPR Article 5(1)(e) prohibits indefinite retention. You need a documented schedule, automated deletion, and explicit user consent at session start. Two-party consent jurisdictions add a separate exposure. Indefinite chat logs read as conservative practice and are in fact non-compliant.
Related concepts
Conversation history shares its budget with everything else in the context window. Once history grows past the effective context size of the model, accuracy degrades quietly through the “lost in the middle” effect, regardless of how the application slices the transcript.
Vector retrieval is built on embeddings, the numerical representation of text that lets a system find semantically similar passages. RAG-style history management is essentially embedding-driven retrieval applied to your conversation log instead of your knowledge base.
Prompt caching is the lever that bends agent-loop economics back towards linear. The system prompt and tool definitions are usually identical across every turn, and caching them at a 90 per cent discount on subsequent reads is often the single biggest cost lever an SME can pull without changing architecture.
Output economics matter alongside history. Input versus output tokens are billed at different rates, and concise structured outputs are materially cheaper than narrative summaries. The honest framing for any owner sitting opposite a vendor pitch is this. The headline pricing-per-token is half the story. The other half is what the vendor’s history-management strategy does to your monthly bill once conversations stretch past turn fifteen.



