Picture an owner-operator in a small meeting room, a vendor deck of three AI customer success stories on the table in front of her, two weeks out from signing a six-figure rollout contract. The numbers on the slides are eye-catching. Eighty per cent adoption, two-and-a-half times return on investment in year one, glowing quotes from peer firms in her sector. She has talked to those peers privately. The numbers they describe are not the numbers on the slide. She wants to push back on the deck without sounding hostile to the engagement, and she is not yet sure how.
That gap, between what is in the deck and what a comparable firm should reasonably plan for, is what this post is about. Published AI case studies are real outcomes at real firms. They are also drawn from the top tail of a heavily left-skewed distribution, and the publication mechanics that put them on the page systematically lift the headline figure above what a comparable customer should expect. Reading them well means separating three things. The existence proof, the median expectation, and the conditions that produced the outcome.
What is an AI case study, really?
An AI case study is a real customer outcome that has been filtered through five publication biases. Survivorship, only successful customers get published. Opt-in reporting, customers who say yes are happier than the average. Vendor selection, among the willing, the vendor picks the cases that flatter the product. Measurement, vendors report what is easy to measure, not what matters most. Timeframe, the window reported is the one that flatters.
None of the five biases requires the vendor to act in bad faith. Each one is a natural consequence of how case studies get commissioned, written, and approved. The customer-side approver wants the firm to look good. The vendor-side marketer wants the product to look good. The case study that survives both filters is the one where both parties are pleased with the outcome.
What this means in practice. A headline figure of three times return on investment with ninety per cent adoption is real for the firm in the case study. It is also drawn from roughly the top decile of comparable customers. The MIT NANDA GenAI Divide research, based on 150 leader interviews and analysis of 300 public deployments, found that around ninety-five per cent of generative AI pilots fail to deliver measurable bottom-line impact. The published case is the survivor of that filtering process, not the typical customer experience.
Why does it matter for your business?
Because the existence proof and the median expectation are different claims, and a vendor deck collapses them into the same number. If you read a published case study as evidence that your firm will see the same result, you have made the implicit assumption that your firm sits at the same point in the distribution as the published customer. That assumption is almost always wrong.
Consider the practical consequence. An owner reviewing a vendor deck before signing a rollout contract is making a budget commitment, a change-management commitment, and a reputational commitment, often all at once. If the deck implies three times return on investment in year one and the realistic median is closer to one-and-a-half times with a meaningful tail of underperformers, the difference shows up as a budget overrun, a slower payback, and a tougher internal conversation in month nine. Reading case studies well is what stops the post-rollout review going sideways.
The same logic runs in the other direction. A failure story, the type that headlines a business magazine, is also drawn from a tail of the distribution. Klarna’s customer-service reversal is real. It is also drawn from the conditions specific to Klarna, very high call volume, transactional service, low emotional stakes per call. The right lesson from Klarna is that the early case studies oversold the fit, and the realistic envelope for AI customer service is narrower than the first wave of narratives suggested.
Where will you actually meet it?
You meet AI case studies in three places, and each one needs the same filter. The vendor pitch deck during a sales conversation. The analyst report citing named rollouts as evidence of a sector trend. The peer conversation with another founder who mentions a firm in your sector that “did this and got that result”. In every case, the figure has passed through one or more of the five biases before it reached you.
Specific examples are worth holding in mind. Allen and Overy’s rollout of Harvey to 3,500 lawyers across 43 jurisdictions is an existence proof that a top-tier global law firm can deploy generative AI to its full lawyer base in production. It is not proof that a fifteen-partner regional firm can. The conditions at Allen and Overy, scale, training budget, central IT function, regulatory engagement capability, are not present at many firms reading the case. The same applies to PwC’s expanded Anthropic partnership, to Lloyds’ agentic financial assistant piloted across 7,000 staff, to BT’s AI Skills Boost programme. Each is a useful existence proof for a class of firm with comparable resources. None is a median expectation for an owner-managed business.
Klarna sits on the other side of the same dynamic. Sebastian Siemiatkowski admitting that cost-led evaluation produced lower quality is genuinely instructive, particularly for firms considering AI in customer-facing operations. But the lesson is “AI customer service has narrower fit than the first case studies suggested”, not “AI customer service does not work”. Applied without that filter, the Klarna story leads to overcorrection in the opposite direction.
When to ask vs when to ignore
Ask the questions when the case study is being used to justify a budget commitment, a rollout plan, or a benchmark against which your firm will be measured. Ignore the case study, in the sense of stopping at “interesting” rather than drawing a lesson, when it is being used to demonstrate that something is possible at all. The two readings serve different purposes and need different scrutiny.
When you do ask, three questions earn their place in any case-study conversation. What conditions were true at this firm that may not be true at yours, naming budget, scale, training infrastructure, regulatory context, leadership engagement. What is the median outcome across all of the vendor’s customers, not just the published ones, with the lower-quartile figure named. What does the abandonment population look like, how many customers walked away inside twelve months and why. A vendor who cannot answer the second and third question, or who will not, has told you something useful about the case-study population.
The discipline is symmetric. Apply it to numbers that flatter your sceptical position with the same rigour you apply to numbers that flatter the vendor. The MIT NANDA ninety-five per cent figure is a useful corrective to the top-decile case-study population, but it is itself a particular study with a particular methodology, and treating it as the universal denominator is the mirror-image of the vendor-deck error.
Related concepts
Reading AI case studies well sits next to several other reading disciplines on this site. Reading AI vendor case studies sceptically applies the same five-bias frame specifically to vendor decks. The AI ROI benchmark sanity check decomposes the headline ROI ranges that circulate in industry research. The reference-call playbook is the operational companion, what to ask the customers behind the case study.
The rest of the named-cases cluster on this site applies the same filter, one named case at a time. Each post takes a public-record rollout or reversal, names the existence proof, names the conditions that produced it, and names what an owner-managed firm should and should not draw from it. Read in sequence, the cluster is built to make the discipline easier, not harder.
If you would like a second pair of eyes on a vendor deck or a rollout plan before you commit, book a conversation.



