She thought the AI had done the contract review for her. The summary was clear, well-structured, and read like a competent lawyer’s note. She signed. Three months later her commercial solicitor spent fifteen minutes on the original and flagged three things the summary had not mentioned. An indemnity carve-out that reversed the risk allocation in the supplier’s favour. An auto-renewal clause with a thirty-day notice window that had already lapsed. A liability cap at three months of fees against a contract that could expose her business to substantially more.
The summary was not wrong, exactly. It was incompletely right in a way that changed the deal. The pattern is common enough now to register in courts and in the insurance market, and it has a structural explanation rooted in how AI summarisation works.
What does AI summarisation reliably keep, drop, and misshape?
It keeps narrative structure and headline claims. It systematically drops cross-references between clauses, defined-term dependencies, materiality qualifications, conditional language, and small technical passages dense with low-frequency vocabulary. It misshapes precise legal terms into generic equivalents, “pending or threatened litigation” becomes “legal issues”, “vendor negligence except for customer misuse outside documented scope” becomes “third-party claims”. The summary reads complete, which is the operational problem.
Abstractive summarisation is a generative task, not a retrieval task. The model calculates which words are statistically likely to come next, based on patterns in its training data. Headline concepts appear frequently in similar documents, so the model weights them heavily. Carve-outs, exceptions, and cross-references are statistically rare relative to those headline concepts, so they get under-weighted. A University of St Gallen study showed the consequence cleanly. AI analysing financial documents it had not seen during training reached 91.6 per cent accuracy on interpretation, beating human analysts at 82.8 per cent. Two-thirds of its errors came not from misunderstanding the finance but from failing to find the relevant information in the first place.
The pattern compounds in multi-document summarisation. ArXiv research found up to 75 per cent of generated content in conversation-domain summaries was hallucinated. Asked to summarise information that did not exist in the source documents, GPT-3.5-Turbo fabricated content roughly 79 per cent of the time and GPT-4o around 44 per cent.
Why is the pattern systematic rather than a fix-with-a-better-prompt problem?
Because it is rooted in how the model assigns attention, not in surface prompt design. Language models calculate statistical importance from training data, so the things that appear frequently in similar documents get weighted heavily and the things that appear rarely, conditional clauses, defined-term dependencies, materiality qualifications, get under-weighted. The fix-with-a-better-prompt instinct treats this as a tuning issue when it is actually a property of the architecture.
Northwestern’s Center for Advancing Safety of Machine Intelligence found that even when AI achieves 90 per cent factual accuracy on isolated questions, performance drops on synthesis tasks that require understanding how different pieces of information relate to each other. Multi-document summarisation models proved over-sensitive to input ordering and under-sensitive to input composition. They produced different summaries from the same documents presented in a different sequence.
There is a separate practical problem even good summaries cannot solve. The incompleteness is not visible inside the summary. A well-written summary of an indemnity clause that has dropped the carve-out reads as authoritative because it is authoritative on the parts it kept. The reader has no way to see the gap from the summary alone. Only a direct comparison against the original surfaces what is missing, which is the inverse of the workflow many owners are using when they reach for AI summarisation in the first place.
Where do AI summaries actually earn their place at SME scale?
Three places. Long unstructured material where the summary’s job is orientation, not decision. Multi-document synthesis where you need pattern recognition across a set of similar documents rather than precision on any single one. Fast first-read at zero stakes, internal meeting transcripts, supplier backgrounders, preliminary market reports where a misunderstanding carries no operational cost. In each case the AI summary is a guide to where the signal sits, not the conclusion itself.
The financial-document orientation case has measured support. Research into AI summaries of annual reports found ChatGPT-generated summaries were 70 to 75 per cent shorter than the originals whilst capturing the sentiment signals that predicted subsequent stock-market reactions better than the raw documents did. The summary surfaces which sections warrant a closer read.
The multi-document pattern-finding case has similar evidence. A 2025 ISDA paper on using generative AI to extract clauses across hundreds of credit support annexes found 90 per cent or better accuracy on simpler, standardised clauses when models were given the relevant taxonomy and clause library. Nuanced clauses remained challenging. The routine 80 per cent of the work, identifying which clauses across a portfolio deviate from a baseline, was reliably tractable. The remaining 20 per cent was not.
Where does the summary alone expose you to material risk?
Three categories deserve a different discipline entirely. Legal text, especially contracts where the whole point is to allocate rights and obligations. Financial schedules and quantitative data where context-dependent details determine the meaning. Regulatory documents where conditional logic and cross-references decide whether a requirement applies to you. In each case the structural elements the AI summary tends to drop are precisely the elements that carry the risk.
The legal evidence is stark. Since mid-2023, more than 300 cases of AI-driven legal hallucinations have been documented in court, with at least 200 recorded in the first eight months of 2025 alone. In the Mata v Avianca case, the foundational example, a lawyer submitted citations to cases that did not exist. The court sanctioned the lawyer and confirmed the attorney remained fully responsible for accuracy, regardless of intent. By August 2025 three separate federal courts had sanctioned lawyers for AI-generated hallucinations within a fortnight.
The insurance market has already registered the risk. As of 2026, major insurers are attaching new exclusions to commercial liability policies covering losses arising from generative AI use, including statements made by chatbots and AI-driven decisions in hiring, lending, and pricing. The signal from the underwriters is direct. You cannot outsource verification to AI without accepting liability when the output turns out to be wrong.
What is the proportionate discipline for using AI summaries safely?
Three steps applied in order. Summarise the summary back to yourself or a colleague, so you are explicit about what you think you learned. Name the categories likely to have been dropped, carve-outs, conditions, exceptions, cross-references, materiality qualifications, then check whether they appear in the summary. On anything material, read the original or have a commercially-competent person read it. The cost is usually lower than the cost of skipping it.
For contracts the discipline is firm. Read the summary, then read the entire original yourself, or have someone competent do so. A commercial solicitor takes perhaps fifteen minutes to spot an indemnity carve-out, an auto-renewal clause with a notice requirement, or a liability cap that exposes you to uninsured risk. An AI summary will commonly miss all three.
For financial documents, use the summary to identify sections that need deeper analysis, then conduct that analysis against the original. If a summary says revenue grew 15 per cent year on year, check the claim against the statement itself. If it says the loan carries a 3 per cent rate, verify whether that applies to the whole loan, whether it is fixed or variable, and whether prepayment penalties apply.
For regulatory documents, treat the AI summary as a first draft of a research agenda. Use it to surface which sections apply, then have a regulatory-competent person read those sections in the original. When a regulation says “except as permitted by [other regulation]”, an AI summary will commonly miss the exception. The stakes are compliance violations and enforcement action.
The economics matter at SME scale. Manual contract review by a solicitor at £200 to £400 per hour on a forty-page contract runs to £800 to £3,200. AI-assisted review can cut that by 50 to 75 per cent, but the saving comes because the AI surfaces issues for human judgment. Skip the human layer and you have shifted the cost of errors to after signature.
If you want a sounding board on where AI-summarised work is already shaping decisions in your business, book a conversation.



