How to read an AI case study without getting lied to

An owner-operator reviewing a printed AI case-study deck across a meeting room table with a vendor account manager mid-conversation
TL;DR

Every AI case study you read is drawn from the top tail of a left-skewed distribution. Five biases stack to inflate the published numbers: survivorship, opt-in reporting, vendor selection, measurement, and timeframe. A useful read separates three things, the existence proof, the median expectation, and the conditions that produced the outcome. Apply the same filter to vendor decks and to failure stories like Klarna.

Key takeaways

- Published AI case studies are real but unrepresentative. They sit in the top tail of a heavily left-skewed customer-outcome distribution. - Five biases stack to lift the headline number above the median: survivorship, opt-in reporting, vendor selection, measurement, and timeframe. - Allen and Overy rolling Harvey out to 3,500 lawyers is an existence proof for top-tier firms, not a median expectation for any law firm. - Klarna's customer-service reversal narrows the fit for AI customer service, it does not invalidate it. The discipline applies symmetrically to failure stories. - The three questions to ask of any AI case study: what conditions were true at this firm, what is the median outcome across the vendor's customers, what does the lower-quartile or abandonment population look like.

Picture an owner-operator in a small meeting room, a vendor deck of three AI customer success stories on the table in front of her, two weeks out from signing a six-figure rollout contract. The numbers on the slides are eye-catching. Eighty per cent adoption, two-and-a-half times return on investment in year one, glowing quotes from peer firms in her sector. She has talked to those peers privately. The numbers they describe are not the numbers on the slide. She wants to push back on the deck without sounding hostile to the engagement, and she is not yet sure how.

That gap, between what is in the deck and what a comparable firm should reasonably plan for, is what this post is about. Published AI case studies are real outcomes at real firms. They are also drawn from the top tail of a heavily left-skewed distribution, and the publication mechanics that put them on the page systematically lift the headline figure above what a comparable customer should expect. Reading them well means separating three things. The existence proof, the median expectation, and the conditions that produced the outcome.

What is an AI case study, really?

An AI case study is a real customer outcome that has been filtered through five publication biases. Survivorship, only successful customers get published. Opt-in reporting, customers who say yes are happier than the average. Vendor selection, among the willing, the vendor picks the cases that flatter the product. Measurement, vendors report what is easy to measure, not what matters most. Timeframe, the window reported is the one that flatters.

None of the five biases requires the vendor to act in bad faith. Each one is a natural consequence of how case studies get commissioned, written, and approved. The customer-side approver wants the firm to look good. The vendor-side marketer wants the product to look good. The case study that survives both filters is the one where both parties are pleased with the outcome.

What this means in practice. A headline figure of three times return on investment with ninety per cent adoption is real for the firm in the case study. It is also drawn from roughly the top decile of comparable customers. The MIT NANDA GenAI Divide research, based on 150 leader interviews and analysis of 300 public deployments, found that around ninety-five per cent of generative AI pilots fail to deliver measurable bottom-line impact. The published case is the survivor of that filtering process, not the typical customer experience.

Why does it matter for your business?

Because the existence proof and the median expectation are different claims, and a vendor deck collapses them into the same number. If you read a published case study as evidence that your firm will see the same result, you have made the implicit assumption that your firm sits at the same point in the distribution as the published customer. That assumption is almost always wrong.

Consider the practical consequence. An owner reviewing a vendor deck before signing a rollout contract is making a budget commitment, a change-management commitment, and a reputational commitment, often all at once. If the deck implies three times return on investment in year one and the realistic median is closer to one-and-a-half times with a meaningful tail of underperformers, the difference shows up as a budget overrun, a slower payback, and a tougher internal conversation in month nine. Reading case studies well is what stops the post-rollout review going sideways.

The same logic runs in the other direction. A failure story, the type that headlines a business magazine, is also drawn from a tail of the distribution. Klarna’s customer-service reversal is real. It is also drawn from the conditions specific to Klarna, very high call volume, transactional service, low emotional stakes per call. The right lesson from Klarna is that the early case studies oversold the fit, and the realistic envelope for AI customer service is narrower than the first wave of narratives suggested.

Where will you actually meet it?

You meet AI case studies in three places, and each one needs the same filter. The vendor pitch deck during a sales conversation. The analyst report citing named rollouts as evidence of a sector trend. The peer conversation with another founder who mentions a firm in your sector that “did this and got that result”. In every case, the figure has passed through one or more of the five biases before it reached you.

Specific examples are worth holding in mind. Allen and Overy’s rollout of Harvey to 3,500 lawyers across 43 jurisdictions is an existence proof that a top-tier global law firm can deploy generative AI to its full lawyer base in production. It is not proof that a fifteen-partner regional firm can. The conditions at Allen and Overy, scale, training budget, central IT function, regulatory engagement capability, are not present at many firms reading the case. The same applies to PwC’s expanded Anthropic partnership, to Lloyds’ agentic financial assistant piloted across 7,000 staff, to BT’s AI Skills Boost programme. Each is a useful existence proof for a class of firm with comparable resources. None is a median expectation for an owner-managed business.

Klarna sits on the other side of the same dynamic. Sebastian Siemiatkowski admitting that cost-led evaluation produced lower quality is genuinely instructive, particularly for firms considering AI in customer-facing operations. But the lesson is “AI customer service has narrower fit than the first case studies suggested”, not “AI customer service does not work”. Applied without that filter, the Klarna story leads to overcorrection in the opposite direction.

When to ask vs when to ignore

Ask the questions when the case study is being used to justify a budget commitment, a rollout plan, or a benchmark against which your firm will be measured. Ignore the case study, in the sense of stopping at “interesting” rather than drawing a lesson, when it is being used to demonstrate that something is possible at all. The two readings serve different purposes and need different scrutiny.

When you do ask, three questions earn their place in any case-study conversation. What conditions were true at this firm that may not be true at yours, naming budget, scale, training infrastructure, regulatory context, leadership engagement. What is the median outcome across all of the vendor’s customers, not just the published ones, with the lower-quartile figure named. What does the abandonment population look like, how many customers walked away inside twelve months and why. A vendor who cannot answer the second and third question, or who will not, has told you something useful about the case-study population.

The discipline is symmetric. Apply it to numbers that flatter your sceptical position with the same rigour you apply to numbers that flatter the vendor. The MIT NANDA ninety-five per cent figure is a useful corrective to the top-decile case-study population, but it is itself a particular study with a particular methodology, and treating it as the universal denominator is the mirror-image of the vendor-deck error.

Reading AI case studies well sits next to several other reading disciplines on this site. Reading AI vendor case studies sceptically applies the same five-bias frame specifically to vendor decks. The AI ROI benchmark sanity check decomposes the headline ROI ranges that circulate in industry research. The reference-call playbook is the operational companion, what to ask the customers behind the case study.

The rest of the named-cases cluster on this site applies the same filter, one named case at a time. Each post takes a public-record rollout or reversal, names the existence proof, names the conditions that produced it, and names what an owner-managed firm should and should not draw from it. Read in sequence, the cluster is built to make the discipline easier, not harder.

If you would like a second pair of eyes on a vendor deck or a rollout plan before you commit, book a conversation.

Sources

- Allen and Overy, Harvey rollout (2022 onward). 3,500 lawyers across 43 jurisdictions, one of the earliest enterprise generative-AI deployments in the legal sector, the canonical existence proof for top-tier firm rollout. https://www.harvey.ai/customer-stories/allen-overy - Slaughter and May, firmwide adoption of Harvey. Comprehensive rollout across all practice areas, illustrating maturation from pilot to integrated workflow at a top-tier firm. https://www.harvey.ai/blog/slaughter-and-may-adopts-harvey-firmwide - PwC and Anthropic expanded partnership (2025). Claude across PwC's global workforce focusing on agentic build, AI-native deal-making, and enterprise function reinvention, an existence proof for Big Four scale. https://www.anthropic.com/news/pwc-expanded-partnership - Lloyds Banking Group, agentic AI financial assistant. Controlled staff pilot of 7,000 employees before broader rollout, illustrating phased implementation in a regulated sector. https://www.lloydsbankinggroup.com/media/press-releases/2026/lloyds-banking-group/ai-driven-benefits-2026.html - BT Group, UK AI Skills Boost programme founding partner. Comprehensive training framework for AI literacy across staff, illustrating the human-side investment that the technical case studies routinely omit. https://newsroom.bt.com/upskilling-business-for-an-ai-future/ - Klarna, AI customer-service reversal (2024-2025). Sebastian Siemiatkowski's admission that cost-led evaluation produced lower quality, hybrid model now in place with humans handling nuanced inquiries, the canonical failure-narrative correction. https://www.emarketer.com/content/klarna-backtracks-ai-customer-service-plans - MIT NANDA initiative, The GenAI Divide report (2025). 95 per cent of generative AI pilots fail to deliver measurable bottom-line impact, based on 150 leader interviews, 350-employee survey, and analysis of 300 public deployments, the median-distribution evidence behind the case-study skew. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/ - Survivorship bias, foundational concept. Pattern across vendor case studies, fund track records, and restaurant guidebooks where published successes systematically exclude abandoners and underperformers. https://en.wikipedia.org/wiki/Survivorship_bias - Orgvue (2025). 92 per cent of organisations have invested in AI but 78 per cent say projects have stalled or failed, corroborating evidence on the left tail of the outcome distribution. https://www.orgvue.com/news/92-of-organizations-have-invested-in-ai-but-78-say-projects-have-either-stalled-or-failed/ - Harvard Business Review (2025). Most AI initiatives fail, a five-part framework to address the gap, named-author research on why headline rollouts skew above the median enterprise outcome. https://hbr.org/2025/11/most-ai-initiatives-fail-this-5-part-framework-can-help

Frequently asked questions

Why do published AI case studies systematically overstate what your firm should expect?

Five biases stack. Survivorship, only customers who succeeded get published. Opt-in reporting, customers who agree to participate are happier than the average. Vendor selection, among the willing the vendor picks the most flattering cases. Measurement, vendors report what is easy to measure, not what matters most. Timeframe, vendors report over the window that flatters. None of this means the numbers are fabricated. It means they are drawn from the top tail of the distribution.

What three questions should I ask of any AI case study in a vendor deck?

First, what conditions were true at this firm that may not be true at yours, budget, scale, training infrastructure, regulatory context. Second, what is the median outcome across all the vendor's customers, not just the published ones, with the lower-quartile figure named. Third, what does the abandonment population look like, how many customers walked away inside twelve months and why. A vendor who cannot or will not answer has told you something useful.

Does this discipline mean I should ignore AI case studies altogether?

No. Case studies are useful as existence proofs. Allen and Overy with Harvey proves a top-tier law firm can deploy generative AI to 3,500 lawyers in production. Klarna's reversal proves that AI customer service has a narrower fit than the early case studies suggested. Both are useful. The mistake is reading the existence proof as a median expectation for a comparable firm. They are different claims and they need different evidence.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation