Using a second AI model to cross-check the first

Person at a desk reviewing printed documents alongside an open laptop, pen in hand
TL;DR

Running a second AI model as a critic of the first is a practical, low-cost way to catch factual errors, invented citations, and plausible-but-wrong figures before they reach clients or regulators. The gain is real: OpenAI research points to 20 to 40 percent fewer factual errors with this approach. UK regulators including the ICO and FCA expect firms to test AI outputs and document their controls. A two-model SOP is a credible, proportionate response.

Key takeaways

- A second-model check means sending AI output to a second model configured as a critic, with a different prompt designed to find errors rather than generate answers. - Critique-then-rewrite prompting cuts factual error rates by 20 to 40 percent and reduces hallucinated legal citations from around 69 percent to under 10 percent. - For UK services firms, ICO data protection guidance, FCA validation expectations, and the EU AI Act's human oversight requirements all point toward documented AI output testing. - Model diversity matters: using models from different providers reduces the correlated blind spots that appear when two variants of the same model cross-check each other. - Start with a written SOP identifying which high-consequence AI tasks require the second check, and log what is run before scaling any automation.

In June 2023, a New York lawyer called Steven Schwartz filed a court brief citing six case law precedents. Every one was invented. ChatGPT had generated them with apparent confidence, and he submitted the brief without independent verification. The judge sanctioned him and his firm. The Law Society of England & Wales cited the case the following year in its guidance on generative AI, urging solicitors to verify all AI output before use.

The failure pattern Schwartz illustrated is not limited to law. It appears in AI-drafted client summaries where figures do not trace to any source. In compliance notes where a regulation has been subtly paraphrased into something inaccurate. In proposals with projected returns built on numbers the model assembled plausibly but incorrectly.

The practical fix for this pattern has a name: a second-model check. Here is what it is, why it matters for your firm, and when it earns its keep.

What is a second-model check?

A second-model check is the practice of sending AI output to a separate model whose job is to critique the first. The generating model, often called M1, produces the answer. The second model, M2, is given a different prompt: find factual errors, invented citations, missing key points, and numbers that cannot be traced to the source. The two models should ideally come from different providers.

This approach has been standard in high-stakes AI for years. JPMorgan has used multiple models to cross-validate trading and risk signals since at least 2017. Google DeepMind’s 2023 self-consistency research showed that sampling multiple reasoning paths and comparing them improved accuracy on maths and logic benchmarks by up to 17 percentage points compared with a single-pass response. Anthropic uses a similar structure internally, running separate safety models that critique and filter the outputs of its core Claude models.

For a small services firm, the setup is simpler than it sounds. M1 drafts your client summary, compliance note, or financial analysis. M2, configured as a critic, reviews the output and returns a list of issues: location, problem, severity, and suggested fix. If M2 finds nothing material, you have a second opinion. If it flags something, you catch the error before it reaches a client. At current API pricing, running a 2,000-token critique through GPT-4o typically costs well under ten pence.

Why does this matter for your business?

Generative AI models make systematic errors that a human reviewer can miss because the output reads fluently. A 2024 OpenAI report found that critique-then-rewrite prompting cut factual error rates by 20 to 40 percent. A 2023 Stanford study of legal AI found hallucinated citations in around 69 percent of responses, falling to under 10 percent when outputs were constrained and independently checked.

UK regulatory expectations point in the same direction. The ICO’s 2023 guidance on AI and data protection asks organisations to put human review and independent testing in place for AI outputs that affect individuals. The FCA expects firms using AI in regulated activities to demonstrate appropriate testing and validation of their models. The EU AI Act, which applies to UK firms serving EU customers, requires human oversight for high-risk AI applications under Articles 14 and 15.

None of these frameworks mandate a specific second-model architecture. They do require that you can demonstrate you have tested your outputs, understood their failure modes, and set up controls. A documented two-model workflow is a credible, proportionate way to satisfy that expectation without needing a compliance team to design it.

The NCSC’s 2024 guidelines on secure AI system development add a useful point: running output through a second model does not mean sharing sensitive data with an additional external service. Data minimisation applies to both models in the chain.

Where will you actually use this?

Second-model checks pay off most clearly where an AI error would be difficult to spot before it reaches a client, a regulator, or a decision-maker. For a 5 to 50 person services firm, the three categories that consistently meet that bar are client-facing documents, regulated analysis, and any AI output that an employee will act on without a separate fact-check.

Client-facing documents include summaries, advice notes, proposal sections with financial projections, and anything going to a regulator or counterparty. Regulated analysis covers financial calculations, pricing models, and any AI-generated content involving personal data, employment decisions, or compliance interpretation. The third category, output that employees act on without checking, is the one most easily overlooked. If a staff member asks the AI to summarise a long contract clause and acts on the summary without reading the original, the check never happens at all.

A few sectors have particularly clear use cases. A boutique law firm using AI to draft matter notes has a direct parallel to the Schwartz case. An accounting practice generating client-facing commentary on financial statements carries obvious accuracy risk. An HR consultancy generating employment guidance from a template prompt runs a meaningful compliance exposure if the model mischaracterises a statutory threshold.

GitHub Copilot’s documentation describes a broadly similar multi-tool validation pipeline for generated code: linters, security scanners, and human reviewers operating in sequence. The structure is the same, applied to a different output type.

When is it worth the extra step, and when is it not?

Map your AI use cases on two dimensions: how often does this task run, and how serious is the consequence of an error? High-frequency, high-consequence tasks are where a second-model check earns its cost in time and API spend. Low-frequency tasks or outputs you review thoroughly yourself before use rarely justify the overhead, and adding unnecessary checks creates its own maintenance burden.

There are three failure modes to guard against.

First: using two models from the same provider and assuming you have genuine independence. Anthropic’s 2024 Claude 3 system card notes that models within the same family can repeat certain hallucinations across variants. For safety-critical checks, you want genuine model diversity, pairing GPT-4o with Claude, or a proprietary cloud model with a vetted open-source alternative.

Second: doubling your data exposure. Running the same client brief through two different external services widens your attack surface and can conflict with ICO data minimisation principles. If the underlying text is sensitive, strip personal identifiers before sending it to either model, or keep both models within a single compliant vendor environment.

Third: letting the workflow become so complex it breaks silently. A well-governed single-model process with consistent human sign-off often outperforms a fragile two-model chain that nobody monitors. Start simply: one written SOP, a defined list of which task types require the second check, and a log of what was run. Only once that process is working should you consider wiring it into automation tools.

What else connects to this?

The second-model check is one layer in a broader AI reliability structure. Human-in-the-loop oversight, where a person reviews and approves output before it is used, is required by the EU AI Act for high-risk applications and recommended by the ICO for any AI that affects individuals. Model diversity, using genuinely different providers rather than variants of the same model, is what makes the reliability gain real.

Constitutional AI, the technique Anthropic developed for training Claude, works on related principles: a separate model critiques the primary output for accuracy and harmfulness before the response is returned. The UK’s Frontier AI Taskforce uses multiple automated evaluators alongside human experts, treating models as test harnesses for each other. The Bank of England’s challenger model framework applies the same logic to model risk in financial services.

If you are processing personal data in either model, a Data Protection Impact Assessment is likely required under UK GDPR. The ICO’s accountability guidance on DPIAs explains when the threshold is met and what the assessment should cover.

The practical starting point is simple: pick one high-consequence AI task your firm runs regularly, write a one-paragraph SOP stating when the second check is required and who reviews the M2 output before it is used, and run it for four weeks. The error rate and escalation rate you observe will tell you whether to expand it further.

Sources

- OpenAI (2024). Research on improving factuality in language models. Internal evaluation showing critique-then-rewrite prompting cut factual error rates by 20 to 40 percent in GPT-4 outputs. https://openai.com/research/improving-factuality-language-models - Wang et al. / Google DeepMind (2023). "Self-consistency improves chain of thought reasoning in language models." Arxiv 2203.11171. Sampling multiple reasoning paths improved accuracy on maths and logic benchmarks by up to 17 percentage points. https://arxiv.org/abs/2203.11171 - Choi, J. et al. (2023). "ChatGPT goes to law school." SSRN Working Paper 4576325. Found hallucinated legal citations in approximately 69 percent of responses, falling to under 10 percent when outputs were constrained and independently checked. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4576325 - ICO (2023). "AI and data protection risk mitigation." Sets out the expectation that organisations put human review and independent testing in place for AI outputs affecting individuals. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection-risk-mitigation/ - Bank of England / Prudential Regulation Authority (2018). "Model risk management principles for banks." Establishes challenger models and independent validation as standard practice for reducing model risk in regulated financial services. https://www.bankofengland.co.uk/prudential-regulation/publication/2018/model-risk-management-principles-for-banks - FCA (2023). Feedback Statement FS23/4 on AI in financial services. States that firms must ensure appropriate testing and validation of AI models and maintain sufficient understanding of how outputs are produced. https://www.fca.org.uk/publication/feedback/fs23-4.pdf - Cabinet Office (2024). "Artificial Intelligence Playbook for the UK Government." Directs departments to fully test AI products before deployment and maintain thorough assurance with human review routes. https://www.gov.uk/government/publications/ai-playbook-for-the-uk-government/artificial-intelligence-playbook-for-the-uk-government-html - NCSC (2024). "Guidelines for secure AI system development." Covers data minimisation and strict interface controls when using external AI APIs, including multi-model architectures. https://www.ncsc.gov.uk/collection/guidelines-for-secure-ai-system-development - Law Society of England & Wales (2024). "Guidance on the use of generative AI." Urges solicitors to independently verify AI outputs and warns against over-reliance on a single tool, citing the Schwartz v. Avianca case as a cautionary example. https://www.lawsociety.org.uk/topics/technology/guidance-on-the-use-of-generative-ai

Frequently asked questions

Do I need a different AI provider for the second check?

Different providers reduce the risk of correlated errors, where both models share the same training-data blind spots and confidently agree on the same wrong answer. For low-stakes checks, two instances of the same model adds some value. For client-facing or regulated outputs, pairing providers, for example GPT-4o alongside Claude, gives you a genuinely independent verdict.

How much does it cost to run a second-model check?

At current API pricing, re-running a 2,000-token answer through GPT-4o for critique typically costs well under ten pence per check at UK SME volumes. For a typical services firm, the cost of a second check is negligible relative to the cost of correcting an error in a client document or responding to a regulator query.

Does a second-model check replace human review?

A second-model check improves the probability of catching errors, but UK and EU regulatory frameworks are clear that human sign-off is a separate requirement, not an optional extra. The ICO's AI guidance, the FCA's validation expectations, and the EU AI Act all treat human oversight as mandatory alongside technical controls like second-model checks. The model can flag problems; a person still decides what to do with them.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation