A founder I work with deployed an AI that drafted customer emails and sent them automatically. Within two weeks the tool sent a price quote with a missing zero, which the customer accepted in writing. The mistake was not the AI. A Tier 3 task had been wired up with Tier 1 oversight. Sending a binding price commitment is not the same shape of work as drafting a meeting summary, and the firm had treated them as if they were.
By 2026 the question is rarely automation or human review as a firm-wide policy. It is which oversight model fits the specific task, and the cost of getting it wrong shows up in incident reports more often than in vendor pitches.
The choice you’re facing
Human-in-the-loop (HITL) means a human reviews and approves the AI’s output before it becomes an action. The AI suggests, drafts or recommends; the human decides. The Air Canada chatbot incident is the cautionary anchor: an unsupervised chatbot told a passenger he could apply retroactively for a bereavement fare, the airline was held liable for the chatbot’s advice and ordered to pay damages by a Canadian tribunal. Brand and legal exposure both followed.
Full automation means the AI takes the action without per-instance human approval. The decision is made and executed; a human may monitor dashboards or audit logs after the fact, but no one approves each output. Invoice categorisation, ticket routing, internal-document summarisation are typical examples.
The middle pattern, increasingly common in 2026, is confidence-threshold routing. The AI evaluates its own certainty about each decision. High-confidence outputs proceed autonomously; low-confidence ones escalate to a human. Swiss Life reported 96% routing accuracy on contact-centre tickets using this approach. The pattern works when the confidence scoring itself is well-calibrated. When the model is overconfident on edge cases, risky decisions slip through.
Three risk dimensions decide which model fits: how easily the decision can be undone (reversibility), how many people are affected if it is wrong (blast radius), and whether it falls under regulatory rules (regulatory exposure). Customer impact, a fourth dimension, often correlates with the first three but is worth checking separately.
When full automation is the right answer
Full automation is the right answer for high-volume tasks with low blast radius, complete reversibility, and no regulatory exposure. The maths is simple: each per-instance approval adds 2-5 minutes of human time. On a thousand-a-month task, that is 30-80 hours of approval work. Removing the gate is where the financial case actually lives.
Internal-only tasks usually qualify. AI categorising support tickets, drafting first-pass meeting summaries, or proposing expense codes that the finance team reconciles at month-end is making decisions caught and corrected in the normal flow of work. The cost of an individual error is a minor delay, not a person harmed.
Knowledge-base search and document summarisation also qualify. The AI returns a starting point; the human still consults the source if the answer matters. The output is advisory by design.
Ticket triage works too, with a confidence threshold attached. Routine tickets land in the right queue automatically; borderline cases route to a human. A misroute is reversible and the harm is small.
The common pattern: low cost of individual error, batch detection acceptable, no individual person bearing the consequence of any single mistake. Audit logging is non-negotiable, but per-decision human approval is overhead the case does not require.
When human-in-the-loop is the right answer
HITL is mandatory or strongly recommended when the decision is customer-facing, regulated, irreversible at the moment of execution, or made in a scenario where the model’s performance baseline is unclear.
Customer-facing communication that could be interpreted as advice or commitment belongs in this category. The Air Canada case turned on this point: the chatbot’s statement about bereavement-fare timing was treated by the tribunal as a representation the customer was entitled to rely on. Pricing, policy interpretation, eligibility statements, anything a customer might act on, should be reviewed before publication.
Regulated decisions affecting individuals are the legal floor. UK GDPR Article 22 gives individuals the right not to be subject to a solely automated decision producing “legal or similarly significant effects”. The ICO interprets this to cover credit, employment, insurance pricing and access to benefits. The right to human intervention must be real, not a rubber stamp.
The EU AI Act extends this for high-risk systems (Annex III): employment decisions, credit and insurance, education access, law enforcement, migration. Article 14 requires the human performing oversight to be competent, trained and able to override the system, and explicitly warns about automation bias. A glance-and-click approval is not what the regulation has in mind.
Hiring decisions deserve a specific call-out. Amazon abandoned an AI hiring system in 2018 after it learned to penalise CVs containing the word “women’s”. UK employment law expects human involvement in decisions that filter or rank candidates, and ACAS guidance recommends consultation with employees before any AI is deployed in the workplace.
Legal and professional advice rounds out the category. The Nebraska Supreme Court suspended an attorney in 2026 after he submitted a brief with 57 defective citations out of 63, including 20 AI-generated hallucinations. The professional remains accountable for the output.
What it costs to get wrong
Two failure modes, opposite in shape.
Under-supervising a Tier 3 task is the headline-grabbing one. The Air Canada chatbot, the Nebraska lawyer suspension, the Amazon hiring algorithm. Same shape every time: a customer-facing or regulated workflow running unsupervised, an output that goes wrong in a way no one catches until external pressure surfaces it. Cost is direct (damages, fines) and indirect (brand, PI premium increases). PI insurers have started asking explicit questions about AI use, and 2026 policies increasingly carry exclusions for unsupervised AI output.
Over-supervising a Tier 1 task is the quieter failure. A team that puts a human in the loop on every email draft, every ticket categorisation, every invoice code is paying salaried hours to do what the AI was meant to free up. The financial case collapses, approval fatigue sets in, and rubber-stamping creeps in. At which point the firm pays for human review and gets none of the benefit.
Confidence-threshold routing has its own failure mode: the threshold gets set wrong. Set too high and everything escalates. Set too low and risky edge cases slip through. The fix is to calibrate against actual outcomes, not against the model’s self-reported confidence, and recalibrate as the workload shifts.
Audit-trail debt is the silent compounder. A workflow that runs autonomously without recoverable logs cannot be defended when challenged. The ICO, the FCA and any insurer doing post-incident review will ask the same question: which model version made which decision when, and on what input?
What to ask before you decide
Five questions, in order, for any AI workflow.
One: how reversible is this decision in the moment it is made? An invoice categorised wrongly is fixed at month-end. A binding price quote sent to a customer is not. The reversibility test sets the floor on oversight.
Two: what is the blast radius if it goes wrong on a single instance? A misrouted internal ticket affects one person briefly. A misapplied lending decision affects one person materially. A wrong public policy statement affects every customer who reads it.
Three: does this decision touch UK GDPR Article 22, the EU AI Act high-risk list, FCA model-risk supervision, employment law or professional indemnity insurance? If yes, the regulation often dictates the answer and you do not get to choose. Map this before designing the workflow.
Four: who is your human reviewer, and is their review designed to require a real decision? “Click here to approve” is not enough. The Article 14 standard is competence, training and the ability to override. Build the workflow so the human has to engage with the substance.
Five: what does the audit trail capture, and how long is it retained? Per-decision logging of input, model version, output and (where applicable) approver. This is the artefact that defends the deployment if it is ever challenged.
The honest answer for the typical UK SME in 2026 is to map every AI workflow to the Tier 1/2/3 framework, match the oversight level, log everything and revisit as the model and the workload evolve. Picking one oversight model for the whole firm is the consistent way to end up with the wrong one for the workflow that matters.



