AI summaries of long documents: what they're allowed to leave out, and what they never should

A founder at her desk holding a printed contract whilst looking at a long document on her laptop screen, a highlighter and notebook beside her
TL;DR

AI summarisation systematically under-weights small, technical, conditional and structural elements like indemnity carve-outs, auto-renewal clauses and liability caps, whilst over-weighting headline narrative. Research at the University of St Gallen, MIT Press and on the AA-Omniscience benchmark shows the pattern is structural, not a tool defect. Three content categories benefit from AI summaries at SME scale: long unstructured material, multi-document synthesis, fast first-read at zero stakes. Three categories require human verification of the original: legal text, financial schedules, regulatory documents.

Key takeaways

- AI summarisation systematically drops cross-references, defined terms, carve-outs, materiality qualifications and conditional language whilst keeping headline claims intact. The summary reads complete, which is the operational problem. - The pattern is structural, not a defect of any one tool. Language models calculate statistical importance from training data, so headline concepts get weighted heavily and rare-but-decisive clauses get under-weighted. A University of St Gallen study found AI achieved 91.6% accuracy interpreting finance but failed two-thirds of errors by being unable to locate the relevant passage in the first place. - In multi-document summarisation, hallucination grows sharply. ArXiv research found up to 75% of generated content in conversation-domain summaries was hallucinated, and GPT-3.5-Turbo and GPT-4o fabricated content 79% and 44% of the time respectively when asked to summarise information that did not exist in the source. - Three SME use cases where AI summaries earn their place: orientation through long unstructured material, pattern-finding across many similar documents, fast first-read at zero stakes. Three where the summary alone is operationally dangerous: legal text, financial schedules, regulatory documents. - The proportionate discipline has three steps. Summarise the summary back to yourself so you are explicit about what you think you learned. Name the categories likely to have been dropped, carve-outs, conditions, exceptions, cross-references. On anything material, run the original past a human who knows what they are reading.

She thought the AI had done the contract review for her. The summary was clear, well-structured, and read like a competent lawyer’s note. She signed. Three months later her commercial solicitor spent fifteen minutes on the original and flagged three things the summary had not mentioned. An indemnity carve-out that reversed the risk allocation in the supplier’s favour. An auto-renewal clause with a thirty-day notice window that had already lapsed. A liability cap at three months of fees against a contract that could expose her business to substantially more.

The summary was not wrong, exactly. It was incompletely right in a way that changed the deal. The pattern is common enough now to register in courts and in the insurance market, and it has a structural explanation rooted in how AI summarisation works.

What does AI summarisation reliably keep, drop, and misshape?

It keeps narrative structure and headline claims. It systematically drops cross-references between clauses, defined-term dependencies, materiality qualifications, conditional language, and small technical passages dense with low-frequency vocabulary. It misshapes precise legal terms into generic equivalents, “pending or threatened litigation” becomes “legal issues”, “vendor negligence except for customer misuse outside documented scope” becomes “third-party claims”. The summary reads complete, which is the operational problem.

Abstractive summarisation is a generative task, not a retrieval task. The model calculates which words are statistically likely to come next, based on patterns in its training data. Headline concepts appear frequently in similar documents, so the model weights them heavily. Carve-outs, exceptions, and cross-references are statistically rare relative to those headline concepts, so they get under-weighted. A University of St Gallen study showed the consequence cleanly. AI analysing financial documents it had not seen during training reached 91.6 per cent accuracy on interpretation, beating human analysts at 82.8 per cent. Two-thirds of its errors came not from misunderstanding the finance but from failing to find the relevant information in the first place.

The pattern compounds in multi-document summarisation. ArXiv research found up to 75 per cent of generated content in conversation-domain summaries was hallucinated. Asked to summarise information that did not exist in the source documents, GPT-3.5-Turbo fabricated content roughly 79 per cent of the time and GPT-4o around 44 per cent.

Why is the pattern systematic rather than a fix-with-a-better-prompt problem?

Because it is rooted in how the model assigns attention, not in surface prompt design. Language models calculate statistical importance from training data, so the things that appear frequently in similar documents get weighted heavily and the things that appear rarely, conditional clauses, defined-term dependencies, materiality qualifications, get under-weighted. The fix-with-a-better-prompt instinct treats this as a tuning issue when it is actually a property of the architecture.

Northwestern’s Center for Advancing Safety of Machine Intelligence found that even when AI achieves 90 per cent factual accuracy on isolated questions, performance drops on synthesis tasks that require understanding how different pieces of information relate to each other. Multi-document summarisation models proved over-sensitive to input ordering and under-sensitive to input composition. They produced different summaries from the same documents presented in a different sequence.

There is a separate practical problem even good summaries cannot solve. The incompleteness is not visible inside the summary. A well-written summary of an indemnity clause that has dropped the carve-out reads as authoritative because it is authoritative on the parts it kept. The reader has no way to see the gap from the summary alone. Only a direct comparison against the original surfaces what is missing, which is the inverse of the workflow many owners are using when they reach for AI summarisation in the first place.

Where do AI summaries actually earn their place at SME scale?

Three places. Long unstructured material where the summary’s job is orientation, not decision. Multi-document synthesis where you need pattern recognition across a set of similar documents rather than precision on any single one. Fast first-read at zero stakes, internal meeting transcripts, supplier backgrounders, preliminary market reports where a misunderstanding carries no operational cost. In each case the AI summary is a guide to where the signal sits, not the conclusion itself.

The financial-document orientation case has measured support. Research into AI summaries of annual reports found ChatGPT-generated summaries were 70 to 75 per cent shorter than the originals whilst capturing the sentiment signals that predicted subsequent stock-market reactions better than the raw documents did. The summary surfaces which sections warrant a closer read.

The multi-document pattern-finding case has similar evidence. A 2025 ISDA paper on using generative AI to extract clauses across hundreds of credit support annexes found 90 per cent or better accuracy on simpler, standardised clauses when models were given the relevant taxonomy and clause library. Nuanced clauses remained challenging. The routine 80 per cent of the work, identifying which clauses across a portfolio deviate from a baseline, was reliably tractable. The remaining 20 per cent was not.

Where does the summary alone expose you to material risk?

Three categories deserve a different discipline entirely. Legal text, especially contracts where the whole point is to allocate rights and obligations. Financial schedules and quantitative data where context-dependent details determine the meaning. Regulatory documents where conditional logic and cross-references decide whether a requirement applies to you. In each case the structural elements the AI summary tends to drop are precisely the elements that carry the risk.

The legal evidence is stark. Since mid-2023, more than 300 cases of AI-driven legal hallucinations have been documented in court, with at least 200 recorded in the first eight months of 2025 alone. In the Mata v Avianca case, the foundational example, a lawyer submitted citations to cases that did not exist. The court sanctioned the lawyer and confirmed the attorney remained fully responsible for accuracy, regardless of intent. By August 2025 three separate federal courts had sanctioned lawyers for AI-generated hallucinations within a fortnight.

The insurance market has already registered the risk. As of 2026, major insurers are attaching new exclusions to commercial liability policies covering losses arising from generative AI use, including statements made by chatbots and AI-driven decisions in hiring, lending, and pricing. The signal from the underwriters is direct. You cannot outsource verification to AI without accepting liability when the output turns out to be wrong.

What is the proportionate discipline for using AI summaries safely?

Three steps applied in order. Summarise the summary back to yourself or a colleague, so you are explicit about what you think you learned. Name the categories likely to have been dropped, carve-outs, conditions, exceptions, cross-references, materiality qualifications, then check whether they appear in the summary. On anything material, read the original or have a commercially-competent person read it. The cost is usually lower than the cost of skipping it.

For contracts the discipline is firm. Read the summary, then read the entire original yourself, or have someone competent do so. A commercial solicitor takes perhaps fifteen minutes to spot an indemnity carve-out, an auto-renewal clause with a notice requirement, or a liability cap that exposes you to uninsured risk. An AI summary will commonly miss all three.

For financial documents, use the summary to identify sections that need deeper analysis, then conduct that analysis against the original. If a summary says revenue grew 15 per cent year on year, check the claim against the statement itself. If it says the loan carries a 3 per cent rate, verify whether that applies to the whole loan, whether it is fixed or variable, and whether prepayment penalties apply.

For regulatory documents, treat the AI summary as a first draft of a research agenda. Use it to surface which sections apply, then have a regulatory-competent person read those sections in the original. When a regulation says “except as permitted by [other regulation]”, an AI summary will commonly miss the exception. The stakes are compliance violations and enforcement action.

The economics matter at SME scale. Manual contract review by a solicitor at £200 to £400 per hour on a forty-page contract runs to £800 to £3,200. AI-assisted review can cut that by 50 to 75 per cent, but the saving comes because the AI surfaces issues for human judgment. Skip the human layer and you have shifted the cost of errors to after signature.

If you want a sounding board on where AI-summarised work is already shaping decisions in your business, book a conversation.

Sources

- University of St Gallen (2025). New benchmark shows AI understands finance but is often blind when searching for information. Cited for the 91.6% finance comprehension accuracy alongside two-thirds of errors stemming from inability to locate the relevant passage. https://www.unisg.ch/en/newsdetail/news/new-benchmark-shows-ai-understands-finance-but-is-often-blind-when-searching-for-information/ - LegalSifter / Ken Adams (2024). Something else not to use AI for, summarising contracts. Cited for the systematic omissions in AI summaries of contract clauses, dropped cross-references, lost defined terms, and generic language replacing precise language. https://adamscontracts.legalsifter.com/blog/something-else-not-to-use-ai-for-summarizing-contracts - ArXiv / ACL Findings (2024). How LLMs hallucinate in multi-document summarisation. Cited for up to 75% hallucinated content in conversation-domain summaries and GPT fabrication rates when summarising information not in the source. https://arxiv.org/abs/2410.13961 - MIT Press TACL (2024). Do multi-document summarisation models synthesise. Cited for over-sensitivity to input ordering and under-sensitivity to input composition in multi-document synthesis tasks. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00687/124262/Do-Multi-Document-Summarization-Models-Synthesize - Amperly (2025). ChatGPT financial statement analysis. Cited for the finding that ChatGPT summaries of annual reports were 70-75% shorter than the originals whilst capturing more relevant insights for sentiment-based market prediction, the orientation use case. https://amperly.com/chatgpt-financial-statement-analysis/ - ISDA (2025). Paper exploring use of generative AI to extract and digitise CSA clauses. Cited for the 90%+ accuracy result on simpler standardised clauses with domain-specific guardrails, alongside the explicit caveat that nuanced clauses remained challenging. https://www.isda.org/2025/05/15/isda-publishes-paper-exploring-use-of-generative-ai-to-extract-and-digitize-csa-clauses/ - Jones Walker (2025). From enhancement to dependency, what the epidemic of AI failures in law means for professionals. Cited for the documented 300+ AI-driven legal hallucination cases since mid-2023 and the Mata v Avianca precedent on professional responsibility. https://www.joneswalker.com/en/insights/blogs/ai-law-blog/from-enhancement-to-dependency-what-the-epidemic-of-ai-failures-in-law-means-for.html - Lathrop GPM (2026). The AI coverage gap, what new insurance exclusions mean for your business. Cited for the 2026 commercial liability exclusions covering generative AI use, including chatbot statements and AI-related regulatory investigations, the insurance market signal on verification responsibility. https://www.lathropgpm.com/insights/the-ai-coverage-gap-what-new-insurance-exclusions-mean-for-your-business/ - Iris.ai (2024). Extractive vs abstractive summaries and how machines write them. Cited for the technical mechanism, abstractive summarisation generates new text based on statistical patterns in training data rather than retrieving facts, which is the source of the pattern in this post. https://iris.ai/blog/tech-deep-dive-extractive-vs-abstractive-summaries-and-how-machines-write-them - Northwestern CASMI (2024). The AI summarisation dilemma, when good enough is not enough. Cited for the finding that even at 90% factual accuracy on isolated questions, models perform poorly on synthesis tasks that require understanding how pieces of information relate to one another. https://casmi.northwestern.edu/news/articles/2024/the-ai-summarization-dilemma-when-good-enough-isnt-enough.html

Frequently asked questions

Can I trust an AI summary of a commercial contract before I sign it?

No, not as the basis for signing. AI summarisation reliably drops cross-references between clauses, fails to track how defined terms propagate through the agreement, and replaces precise language with generic language. A summary that says "the vendor indemnifies the customer for third-party claims" can be missing a carve-out that reverses the entire risk allocation. Read the summary to orient yourself, then read the original or have a commercially competent person do so. The cost of fifteen minutes of legal review is lower than the cost of a missed indemnity carve-out, an auto-renewal trap, or an uncapped liability.

Where do AI summaries actually add value at SME scale?

Three places. First, long unstructured material where the summary's job is to orient you to where the signal sits, not to be the decision itself. Second, multi-document synthesis across a set of related documents where you need pattern recognition rather than precision on any single source. Third, fast first-read at zero stakes, internal meeting transcripts, supplier backgrounders, preliminary market reports. The risk is not getting 90% of the content, it is mistaking a 90% summary for a 100% understanding in a context where precision matters.

What is the verification routine for high-stakes documents?

Three steps. First, summarise the summary back to yourself out loud or to a colleague. This catches obvious omissions and forces you to be explicit about what you think you learned. Second, name the categories of information that were likely dropped, carve-outs, conditions, exceptions, cross-references, materiality qualifications. In contracts these are always present, so if the summary does not mention them, they were probably omitted. Third, on anything material, read the original or have a competent person do so. The discipline is proportionate, the cost of skipping it is usually the kind of surprise you would have preferred to avoid.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation