What is conversation history? Why your AI bill scales quadratically

A person at a desk reviewing a long chat transcript on a laptop screen, notebook beside them in a small office
TL;DR

Conversation history management is how an AI application maintains the illusion of memory across turns. Models are stateless, so each new request silently re-ships earlier exchanges. Naive loops scale quadratically: a 10-step agent can consume 472,500 input tokens against 9,000 for a single-pass equivalent. Quality degrades alongside cost, with multi-turn performance dropping 39 per cent against single-turn baselines. The fix is curated history, not more of it.

Key takeaways

- Large language models do not remember anything between turns. Every call to the model includes whatever history the application chose to ship, which means you are paying to re-send the same tokens repeatedly. - Naive loops scale quadratically. A 10-step coding agent has been measured at 472,500 input tokens versus 9,000 for a single-pass equivalent, a 43x multiplier on the same business outcome. - Quality degrades alongside cost. Multi-turn performance drops 39 per cent on average versus single-turn, and models double down on early mistakes rather than recovering. - Five compaction strategies cover the field: sliding window, token truncation, summarisation, vector retrieval, and hierarchical memory. Conversation length and stakes decide which earns its keep. - Conversation logs containing customer data are personal data under GDPR Article 5(1)(e). Indefinite retention is non-compliant, not conservative. Documented schedule, automated deletion, and PII redaction are day-zero requirements.

A 22-staff recruitment firm builds an AI candidate-screening assistant. Each conversation runs thirty to fifty turns. Gather background, probe experience, walk through scenarios, draft a recommendation. Week one looks great. The model is sharp, the conversations feel natural, the cost-per-screen is small enough to ignore. Week six the bill triples and nobody can immediately say why.

The why is in the loop. By turn thirty, the assistant is sending around 50,000 tokens of accumulated history with every new question, because the team built the simplest possible thing: re-ship the whole transcript on each call. The cost curve is not the only issue. The privacy policy does not name AI logging, retention is indefinite, and the conversation logs contain candidate personal data. Both problems sit one architectural decision away. The decision has a name.

What is conversation history management?

Conversation history management is the layer that decides what to send to the model on each turn, what to summarise, what to drop, and where to store it. Large language models are stateless. They do not remember earlier exchanges. The application maintains the illusion of memory by capturing each message, retrieving relevant prior turns from storage, and packaging that bundle with the new question before calling the model.

That cycle runs silently behind every multi-turn chat. From the user’s seat, the conversation flows. Behind the scenes there is no memory, only repeated reminders, and somebody is paying for every token of that reminder on every turn.

Vendors implement the layer differently. OpenAI’s Threads store the transcript on their servers and you reference an ID. Anthropic’s Conversation object hands the application explicit control. Google Dialogflow CX persists state in Sessions for thirty minutes by default. The architectural choice is not academic. It decides where your data lives, who is responsible for retention, and how predictable your monthly bill becomes once conversations stretch.

Why the cost curve bends quadratically and quality degrades

In a naive loop, each turn ships the whole transcript again. A measured 10-step coding agent consumed 472,500 input tokens against 9,000 for a single-pass equivalent, a 43x multiplier on the same business outcome. Input tokens dominate spend in agent loops, accounting for around 54 per cent of total cost, precisely because historical context is being rebilled on every turn rather than computed fresh.

Quality follows the same shape. Research from Microsoft and Salesforce in May 2025 found multi-turn performance dropped 39 per cent on average against single-turn baselines. Worse, when a model makes an early mistake it tends to double down rather than recover. By turn ten the response can be twice as long as it should be and fundamentally misaligned with what the user asked. The remedy is better-curated history rather than more of it. Vendors implementing intelligent memory systems report 26 per cent better response quality alongside 80 to 90 per cent token reduction, which suggests the two metrics move together rather than trading off.

Where you will meet it in the wild

You will meet conversation history management on every multi-turn AI vendor’s pricing or architecture page, dressed in different clothes. OpenAI’s Assistants API uses Threads, server-side conversation containers where you reference an ID and the platform handles compression. Anthropic’s Conversation object lets your application manage messages directly while providing the plumbing. Google Dialogflow CX persists state in Sessions, defaulting to thirty minutes and extendable to twenty-four hours.

Microsoft Agent Framework offers two storage models, service-managed (state on the vendor’s side, you reference an ID) and client-managed (you keep the history locally and ship it on each request). Emerging platforms like Mem0 sit on top, extracting key facts from conversations and discarding raw transcripts. Each pattern moves the cost-versus-control trade-off to a different point. The procurement question worth asking is “where does my history live, how long is it kept, can I delete it on demand, and who can access it?” If the vendor cannot answer cleanly, treat that as the answer.

Thread leakage is the failure mode owners rarely hear about until it bites someone else. In March 2023 OpenAI suffered a glitch that exposed conversation titles to other users’ browser sidebars. The content was not visible, but a title like “Medical diagnosis discussion” or “Legal advice on contract” is itself sensitive metadata. For SMEs running multi-tenant platforms, the diligence question is how your conversation data is isolated from other customers’ data, what access controls govern who can retrieve it, and whether the vendor offers UK data residency or encryption at rest. If the vendor offers vague reassurance instead of documentation, the risk is real.

When to plan compaction versus run a window

For short transactional conversations of three to ten turns, a sliding window is usually fine. Keep the last N messages, drop the rest, accept that the AI will lose context from the start of long conversations. For sales conversations, multi-day research, or anything spanning sessions, plan for summarisation or vector retrieval from day one. For high-volume customer-facing chatbots, hierarchical memory or an intelligent-memory layer earns its complexity within months, not years.

The five compaction strategies in plain English: sliding window keeps the last N messages, fast and limited; token truncation cuts based on a token count, same limitation, smarter at the boundary; summarisation compresses earlier turns into a paragraph, retains gist at the cost of nuance; vector retrieval stores past messages as embeddings and pulls only the relevant ones into each new turn; hierarchical memory organises into layers (domain summary, category summary, specific traces) and traverses top-down.

For any conversation containing personal data, the retention schedule is a day-zero requirement, not a retrofit. GDPR Article 5(1)(e) prohibits indefinite retention. You need a documented schedule, automated deletion, and explicit user consent at session start. Two-party consent jurisdictions add a separate exposure. Indefinite chat logs read as conservative practice and are in fact non-compliant.

Conversation history shares its budget with everything else in the context window. Once history grows past the effective context size of the model, accuracy degrades quietly through the “lost in the middle” effect, regardless of how the application slices the transcript.

Vector retrieval is built on embeddings, the numerical representation of text that lets a system find semantically similar passages. RAG-style history management is essentially embedding-driven retrieval applied to your conversation log instead of your knowledge base.

Prompt caching is the lever that bends agent-loop economics back towards linear. The system prompt and tool definitions are usually identical across every turn, and caching them at a 90 per cent discount on subsequent reads is often the single biggest cost lever an SME can pull without changing architecture.

Output economics matter alongside history. Input versus output tokens are billed at different rates, and concise structured outputs are materially cheaper than narrative summaries. The honest framing for any owner sitting opposite a vendor pitch is this. The headline pricing-per-token is half the story. The other half is what the vendor’s history-management strategy does to your monthly bill once conversations stretch past turn fifteen.

Sources

Drainpipe (2026). What is conversation history management? The canonical plain-English introduction to CHM and the three-step capture-store-retrieve cycle. https://drainpipe.io/knowledge-base/what-is-conversation-history-management/ Laban et al. (2025). LLMs Get Lost in Multi-Turn Conversation. Microsoft Research and Salesforce paper measuring the 39 per cent multi-turn performance drop. https://arxiv.org/abs/2505.06120 Augment Code (2025). AI agent loop token cost and context constraints. Source for the 472,500 versus 9,000 input-token measurement on a 10-step agent loop. https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints Mem0 (2025). LLM chat history summarisation strategies. Industry analysis of compaction strategies and the 26 per cent quality improvement alongside 80 to 90 per cent token reduction. https://mem0.ai/blog/llm-chat-history-summarization-guide-2025 Microsoft (2025). Chat history storage patterns in Microsoft Agent Framework. Service-managed, client-managed, and hybrid storage models documented by the framework team. https://devblogs.microsoft.com/agent-framework/chat-history-storage-patterns-in-microsoft-agent-framework/ Google (2026). Dialogflow CX session documentation. Default 30-minute session persistence and TTL configuration for managed conversation state. https://docs.cloud.google.com/dialogflow/cx/docs/concept/session Information Commissioner's Office (2024). Guidance on GDPR storage limitation principle. The UK regulator on Article 5(1)(e) retention requirements. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-protection-principles/a-guide-to-the-data-protection-principles/the-principles/storage-limitation/ OpenAI (2025). Assistants API deep dive. Vendor documentation of Threads as server-side conversation containers and automatic truncation behaviour. https://developers.openai.com/api/docs/assistants/deep-dive

Frequently asked questions

How long before naive history management becomes a real cost problem?

For short transactional chats of three to ten turns, a naive loop is fine. Costs become visible at around turn fifteen and break the unit economics by turn thirty, which is roughly month two for a high-volume customer chatbot. If your conversations regularly cross fifteen turns or span sessions, plan for compaction from day one rather than retrofitting it after the bill arrives.

What is the simplest compaction strategy I can ask my vendor about?

Sliding window is the simplest. The application keeps the last N messages and drops the rest. It is fast, predictable, and adequate for transactional support where recent context is what matters. Once conversations need to remember earlier facts, summarisation or vector retrieval earn their keep. Hierarchical memory is the most sophisticated and only justifies its complexity at scale.

Is storing conversation history a GDPR issue?

Yes, if any of it contains personal data. Names, email addresses, order numbers, even inferred preferences are in scope. Article 5(1)(e) prohibits indefinite retention. You need a documented retention schedule tied to a business purpose, automated deletion that enforces it, and a privacy notice that names AI logging explicitly. PII redaction before storage is now table stakes; verify it is enabled by default.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation