Amazon’s engineering team spent the better part of three years building a recruiting AI before concluding, in 2017, that they could not make it safe enough to use. The system had been trained on a decade of CVs submitted by applicants, most of whom were male, because Amazon’s existing engineering workforce was heavily male. By 2015, the model was downgrading applications that contained the word “women’s” and penalising graduates of two all-women’s colleges. Engineers patched those specific signals out. The model found new ones. Amazon shelved it.
That story gets told a lot in AI circles. What gets told less often is what it actually means for a 20-person services firm considering whether to add AI to its hiring process. The answer involves data quality, governance, and legal exposure in ways that are directly applicable to any business running automated processes that affect people.
What does an AI project failure actually look like?
Real failures are rarely dramatic. A hiring tool gets shelved after three years because engineers cannot guarantee it is not discriminating. A clinical AI costs $62 million before cancellation. A chatbot gives a passenger the wrong refund policy and ends up in court. The underlying pattern tends to be the same: bad data, a misaligned objective, or a governance gap nobody caught early.
Between 2014 and 2017, Amazon’s AI was technically functional. The problem was what it had learned to optimise for. Trained on historically successful hiring patterns from a predominantly male applicant pool, it treated male-adjacent signals as positive features. The model was doing exactly what it had been asked to do: predict which applications looked like past successful hires. Replicating historical bias was built into the objective from the start.
IBM’s Watson for Oncology followed a different version of the same pattern. MD Anderson Cancer Center began working with IBM in 2013 on a clinical decision-support tool. Four years and roughly $62 million later, the project was cancelled. Internal documents revealed the system had sometimes recommended unsafe treatments, having been trained on limited, often hypothetical, patient data rather than on large real-world datasets. The data problem was fundamental, and it was not identified until years into the engagement.
Why does the failure rate matter for your business?
S&P Global Market Intelligence puts the share of AI projects scrapped between proof of concept and full adoption at around 46 per cent. For a small services firm, a failed project carries proportionally higher cost than it does for an enterprise with dedicated AI teams and deep pockets. Scale, time, and diverted attention all add up faster when your margins are tight.
There are two distinct ways an AI project can fail. The first is quiet abandonment: a proof of concept that looked promising, then ran into data quality problems, integration complexity, or unclear success criteria and never reached production. This is the common type, and it can be managed if you design for it. The second is more costly: a project that does reach production but causes harm. A chatbot that misrepresents your firm’s policy to clients. A hiring shortlist that inadvertently filters out protected groups. A pricing model that behaves erratically in edge cases. Gartner estimated that at least 30 per cent of generative AI projects would be abandoned by end of 2025, citing poor data quality, inadequate risk controls, and unclear business value as primary drivers.
For a small firm, the second type carries the greater risk. Amazon caught the bias before the tool was used in live decisions. Air Canada’s chatbot was already live when the problem came to light.
Where will your business actually meet these failure patterns?
The most commercially exposed areas in a services business are hiring, client-facing automation, and any decision-making that touches pricing, credit, or compliance. A 2024 tribunal ruled Air Canada liable for its chatbot’s incorrect statements to a passenger, ordering the airline to honour a discount it had never intended to offer. That ruling sits under consumer law rather than specialist AI regulation.
In the UK, the regulatory landscape covering these areas is already active. The ICO’s guidance on AI and data protection requires organisations using AI for decision-making to carry out Data Protection Impact Assessments and to be able to explain automated decisions affecting individuals. Under Article 22 of UK GDPR, individuals have rights in relation to automated decisions that carry legal or significant effects, including the right to human intervention and the right to contest decisions.
The FCA, in its 2022 joint discussion paper with the Bank of England on AI in financial services, warned that models trained on historically skewed data could amplify bias in credit and insurance decisions, potentially breaching obligations under the Equality Act 2010. The NCSC’s machine learning security guidance adds that AI systems can drift over time, producing outputs that degrade without obvious warning signals. Firms that do not monitor for this create operational exposure that may not surface until a client is already affected.
When is a cancelled AI project a warning sign, and when is it a rational outcome?
A short, well-scoped pilot that ends without proceeding to production can be a perfectly good outcome, provided you went in with clear learning objectives and exited with something useful. The warning sign is different: a project that ran for months or years without clear success criteria, that went live without governance checks, or that only revealed its problems when a customer or regulator was already involved.
Amazon’s project illustrates the manageable version. Engineers caught the bias in internal testing before the tool was used in live hiring decisions, which meant the company could exit without legal exposure. The reputational cost was real but contained. Air Canada illustrates the other path: a tool that was live and had already produced incorrect information in a commercially binding context when the problem came to light.
The manageable version of a cancelled project starts with a small scope, a time limit, and an exit condition. “We’ll test this tool for three months; if it achieves X, we proceed; if not, we stop” is structurally very different from “we want to use AI to improve our hiring process and see what happens.” The first has a defined endpoint. The second can run indefinitely because there is nothing to measure against.
The diagnostic question for a founder reviewing a current or past AI project is whether it ever had a defined success metric, a plan for testing it against adversarial conditions, and a structured exit if those tests failed. Those three things separate a controlled experiment from a liability.
What to check before you commit to your next AI project
Amazon’s hiring AI ran for three years before engineers concluded the tool could not be made safe enough. IBM’s Watson oncology project cost an estimated $62 million before cancellation. Both failures share a common diagnostic: nobody had asked, at the outset, whether the training data was representative, whether the success metric was clear, or whether the team had a process for detecting harm before it reached people.
The ICO recommends a Data Protection Impact Assessment before deploying AI that makes or influences decisions about individuals. The NCSC advises threat modelling and adversarial testing before deployment, plus ongoing monitoring for model drift. The practical application for a services firm: start with one specific business problem and a single measurable success metric before you choose a tool.
Audit your input data before building anything: where does it come from, does it reflect historical patterns that may be biased, does it cover the full range of situations the tool will encounter? Keep a human in the accountability role for any decision with significant consequences. The model can handle the mechanics; the responsibility for outcomes has to sit with a person. Document what the tool does, what data it uses, and what tests you ran before deploying it.
Then test the tool as if someone were actively trying to make it produce the wrong answer. The Air Canada case shows what happens when none of these steps precede launch. The Amazon case shows what happens when they happen, but only after the project has run for three years.
If you’re working out whether a current or planned project is structured to find problems cheaply rather than after the fact, that is a useful conversation to have before something goes live. Book a conversation.



