A founder of a 40-person services firm forwarded me the Klarna story last week. AI was going to replace 700 customer-service agents, then it did not, then they were rehiring. She wanted to know whether to bin her own plan for an AI support layer entirely, or whether the story said something narrower than “AI in customer service does not work”.
It says something narrower. The easy reading is wrong, and reading it the wrong way will lead a smaller firm to either over-rotate to AI in service of cost, the way Klarna did, or to back away from a tool that would genuinely help. Both are expensive mistakes. The Klarna case is useful, but only if you read what it actually shows.
What did Klarna actually do, and what happened?
In December 2023 Klarna froze all non-engineer hiring on the back of AI deployment. Within a month the AI was handling 2.3 million customer conversations across 35 languages, with reported 82 per cent faster response times and 25 per cent fewer repeat inquiries. Through 2024 and into 2025 Klarna reversed course, started rehiring human agents, and CEO Sebastian Siemiatkowski publicly admitted that cost had been over-weighted as an evaluation factor.
The early operating-improvement projection sat at around 40 million USD a year. The deflection metrics held up. The customer experience did not, and that was the half of the equation the cost-led model had not protected.
The arc is real. The deflection metrics are real. The reversal is real. The mistake is reading the reversal as a verdict on AI customer service rather than a verdict on the deployment shape Klarna chose.
Why is the easy reading of the Klarna reversal wrong?
The headline summary, AI customer service does not work, treats Klarna as a controlled experiment on a single variable. It was not. Klarna deployed AI uniformly across an inquiry mix that included routine queries (order status, balance checks, account questions) and emotionally complex ones (refunds after a problem, debt-management conversations, dispute escalations, account closures).
The AI handled the routine inquiries well. It degraded the customer experience on the emotionally complex ones, because customers in those moments expect a human to acknowledge what is going on before resolving it. The single-tier deployment was the mistake, not the tool.
A two-tier deployment, with AI on the routine layer and humans on the emotional layer, would not have produced the same reversal. Klarna has effectively built that two-tier model in the hybrid version it now runs, with AI handling roughly two-thirds of inquiries and humans escalated for the rest. That is the conclusion the case actually reaches, and it is the one that ports to a smaller firm thinking about the same deployment.
Where does this discipline meet a UK services firm of 40?
A smaller services firm will not have 2.3 million inquiries a month. The volume difference does not change the segmentation principle, it sharpens it. With smaller inquiry volume, a single badly handled emotional inquiry is a larger share of your monthly customer experience, so segmentation matters more in a 40-person firm than in a fintech, not less.
The practical move is to look at the last 100 client inquiries and sort them by the emotional weight the client brought to the conversation, not by topic label in your CRM. The same topic can sit in either bucket. A refund request after a smooth job is routine. A refund request after a delayed delivery and three missed callbacks is emotionally loaded, and the client expects to be heard before being resolved.
Sort by that, deploy AI to the bottom two-thirds, hold the top third for a human, and you have already designed past the Klarna mistake before you have shortlisted a vendor. The real work is the segmentation, not the software selection. Firms commonly do this in the wrong order, evaluating vendors first and segmenting last, and the rest of the project inherits the cost of that sequencing.
When should an owner-managed firm say no to an AI customer-service layer?
Say no if your inquiry mix is dominated by emotionally loaded conversations, or if you cannot reliably separate the routine from the complex at the point an inquiry first reaches you. That is not a permanent answer. It says the segmentation work has to come first, and that without it the deployment will inherit the Klarna mistake in miniature.
Say no, too, if cost is your only evaluation factor. The same metric that justified the Klarna hiring freeze, projected operating saving, was the one that drove the reversal eighteen months later. Cost is a legitimate input, it is rarely the most important one for a customer-service tool, and the reason is the asymmetry of what gets lost when each kind of inquiry goes wrong.
A small saving on routine inquiries handled by AI does not compensate for the client relationship lost when an emotional inquiry is fumbled by the same AI. An owner shopping the cheapest tool will discover at month four that the price is being paid somewhere they were not looking. The Klarna case names that trap in public, with a real CEO putting it on the record. There is no need for a smaller firm to learn the same lesson the same way.
Related concepts worth holding alongside the Klarna case
The Air Canada bereavement-fare tribunal sits adjacent. A Canadian court held the airline liable for incorrect information its own chatbot gave a grieving passenger, rejecting the argument that the chatbot was a separate legal entity. That case adds a legal-liability dimension Klarna does not, and widens the asymmetry, you are responsible for what your AI says to a customer in your name.
Verdantix and Salesforce research on customer-service deployment patterns through 2024 and 2025 has converged on the same segmentation point Klarna learned in public. AI works for augmentation and for routine inquiries, it under-performs when it is treated as a replacement for human service across the full mix. The MIT NANDA report’s finding that 95 per cent of generative AI pilots fail to produce measurable bottom-line impact is the population-level frame around Klarna.
The two cases together, Klarna and Air Canada, build a stronger discipline than either does alone. Klarna names the segmentation failure. Air Canada names the accountability failure. A firm sitting between them, evaluating its own AI customer-service plan, has the shape of the discipline it needs without having to invent it.
The brief on the founder’s desk at the start of this post should still go ahead. Just not as Klarna designed it. Segment by emotional weight, hold the top third for a human, and let cost sit alongside quality rather than over-ruling it.



