In June 2023, a New York lawyer called Steven Schwartz filed a court brief citing six case law precedents. Every one was invented. ChatGPT had generated them with apparent confidence, and he submitted the brief without independent verification. The judge sanctioned him and his firm. The Law Society of England & Wales cited the case the following year in its guidance on generative AI, urging solicitors to verify all AI output before use.
The failure pattern Schwartz illustrated is not limited to law. It appears in AI-drafted client summaries where figures do not trace to any source. In compliance notes where a regulation has been subtly paraphrased into something inaccurate. In proposals with projected returns built on numbers the model assembled plausibly but incorrectly.
The practical fix for this pattern has a name: a second-model check. Here is what it is, why it matters for your firm, and when it earns its keep.
What is a second-model check?
A second-model check is the practice of sending AI output to a separate model whose job is to critique the first. The generating model, often called M1, produces the answer. The second model, M2, is given a different prompt: find factual errors, invented citations, missing key points, and numbers that cannot be traced to the source. The two models should ideally come from different providers.
This approach has been standard in high-stakes AI for years. JPMorgan has used multiple models to cross-validate trading and risk signals since at least 2017. Google DeepMind’s 2023 self-consistency research showed that sampling multiple reasoning paths and comparing them improved accuracy on maths and logic benchmarks by up to 17 percentage points compared with a single-pass response. Anthropic uses a similar structure internally, running separate safety models that critique and filter the outputs of its core Claude models.
For a small services firm, the setup is simpler than it sounds. M1 drafts your client summary, compliance note, or financial analysis. M2, configured as a critic, reviews the output and returns a list of issues: location, problem, severity, and suggested fix. If M2 finds nothing material, you have a second opinion. If it flags something, you catch the error before it reaches a client. At current API pricing, running a 2,000-token critique through GPT-4o typically costs well under ten pence.
Why does this matter for your business?
Generative AI models make systematic errors that a human reviewer can miss because the output reads fluently. A 2024 OpenAI report found that critique-then-rewrite prompting cut factual error rates by 20 to 40 percent. A 2023 Stanford study of legal AI found hallucinated citations in around 69 percent of responses, falling to under 10 percent when outputs were constrained and independently checked.
UK regulatory expectations point in the same direction. The ICO’s 2023 guidance on AI and data protection asks organisations to put human review and independent testing in place for AI outputs that affect individuals. The FCA expects firms using AI in regulated activities to demonstrate appropriate testing and validation of their models. The EU AI Act, which applies to UK firms serving EU customers, requires human oversight for high-risk AI applications under Articles 14 and 15.
None of these frameworks mandate a specific second-model architecture. They do require that you can demonstrate you have tested your outputs, understood their failure modes, and set up controls. A documented two-model workflow is a credible, proportionate way to satisfy that expectation without needing a compliance team to design it.
The NCSC’s 2024 guidelines on secure AI system development add a useful point: running output through a second model does not mean sharing sensitive data with an additional external service. Data minimisation applies to both models in the chain.
Where will you actually use this?
Second-model checks pay off most clearly where an AI error would be difficult to spot before it reaches a client, a regulator, or a decision-maker. For a 5 to 50 person services firm, the three categories that consistently meet that bar are client-facing documents, regulated analysis, and any AI output that an employee will act on without a separate fact-check.
Client-facing documents include summaries, advice notes, proposal sections with financial projections, and anything going to a regulator or counterparty. Regulated analysis covers financial calculations, pricing models, and any AI-generated content involving personal data, employment decisions, or compliance interpretation. The third category, output that employees act on without checking, is the one most easily overlooked. If a staff member asks the AI to summarise a long contract clause and acts on the summary without reading the original, the check never happens at all.
A few sectors have particularly clear use cases. A boutique law firm using AI to draft matter notes has a direct parallel to the Schwartz case. An accounting practice generating client-facing commentary on financial statements carries obvious accuracy risk. An HR consultancy generating employment guidance from a template prompt runs a meaningful compliance exposure if the model mischaracterises a statutory threshold.
GitHub Copilot’s documentation describes a broadly similar multi-tool validation pipeline for generated code: linters, security scanners, and human reviewers operating in sequence. The structure is the same, applied to a different output type.
When is it worth the extra step, and when is it not?
Map your AI use cases on two dimensions: how often does this task run, and how serious is the consequence of an error? High-frequency, high-consequence tasks are where a second-model check earns its cost in time and API spend. Low-frequency tasks or outputs you review thoroughly yourself before use rarely justify the overhead, and adding unnecessary checks creates its own maintenance burden.
There are three failure modes to guard against.
First: using two models from the same provider and assuming you have genuine independence. Anthropic’s 2024 Claude 3 system card notes that models within the same family can repeat certain hallucinations across variants. For safety-critical checks, you want genuine model diversity, pairing GPT-4o with Claude, or a proprietary cloud model with a vetted open-source alternative.
Second: doubling your data exposure. Running the same client brief through two different external services widens your attack surface and can conflict with ICO data minimisation principles. If the underlying text is sensitive, strip personal identifiers before sending it to either model, or keep both models within a single compliant vendor environment.
Third: letting the workflow become so complex it breaks silently. A well-governed single-model process with consistent human sign-off often outperforms a fragile two-model chain that nobody monitors. Start simply: one written SOP, a defined list of which task types require the second check, and a log of what was run. Only once that process is working should you consider wiring it into automation tools.
What else connects to this?
The second-model check is one layer in a broader AI reliability structure. Human-in-the-loop oversight, where a person reviews and approves output before it is used, is required by the EU AI Act for high-risk applications and recommended by the ICO for any AI that affects individuals. Model diversity, using genuinely different providers rather than variants of the same model, is what makes the reliability gain real.
Constitutional AI, the technique Anthropic developed for training Claude, works on related principles: a separate model critiques the primary output for accuracy and harmfulness before the response is returned. The UK’s Frontier AI Taskforce uses multiple automated evaluators alongside human experts, treating models as test harnesses for each other. The Bank of England’s challenger model framework applies the same logic to model risk in financial services.
If you are processing personal data in either model, a Data Protection Impact Assessment is likely required under UK GDPR. The ICO’s accountability guidance on DPIAs explains when the threshold is met and what the assessment should cover.
The practical starting point is simple: pick one high-consequence AI task your firm runs regularly, write a one-paragraph SOP stating when the second check is required and who reviews the M2 output before it is used, and run it for four weeks. The error rate and escalation rate you observe will tell you whether to expand it further.



