How to measure AI output quality without a research project

An associate reviewing two stacks of marked-up documents side by side at a desk in late afternoon light
TL;DR

Output quality is half of AI ROI and almost nobody measures it. A calibrated rubric, blinded comparative assessment, and a stratified sample of around thirty items per side gives an SME a defensible quality finding without a research project.

Key takeaways

- Hours-saved without quality is half a ROI calculation. Faster output with 5 percent errors on dimensions you care about is hidden risk that wipes the savings. - A useful rubric has three to five dimensions, a five-point scale, and calibrated definitions for each scale point. - Blinded assessment matters because reviewers score identical work differently when they know the source. - Sample around thirty items per side, stratified, scored by someone not involved in AI procurement. - Categorise errors (hallucinations, omissions, misunderstandings) so the firm sees whether they are systematic or random, and whether process redesign can address them.

Picture a managing partner I’ll call Simon. Thirty-lawyer firm, six months into an AI document review tool. The reviews come back faster. The associates are happier. The vendor’s monthly report cites a 30 percent reduction in review time. Simon does not know whether the same level of nuance is being caught on complex factual issues. Nobody has independently checked. The next major client matter is in four weeks. Simon does not yet know what he does not know.

Output quality is half of AI ROI. Most SMEs measure the other half and skip this one. The reasons are familiar. Quality is multidimensional, harder to score than time, and easy to skip when the firm wants the rollout to look good.

Why is hours-saved half of an ROI calculation?

A 30 percent faster process with 5 percent false negatives on materially relevant work has not saved any time once the downstream cost lands. False negatives in a contract review become risk. False negatives in audit testing become compliance exposure. False negatives in clinical documentation become patient safety. Hours saved are real; the cost they create elsewhere is also real, and rarely visible until the consequence arrives.

The firm that reports “hours saved 30 percent” without a quality figure is reporting half a number. The other half is whether the output is good enough on the dimensions that matter. Both numbers belong on the same page.

The right response is to insist on the matching quality figure alongside hours-saved, before drawing any financial conclusion from the time savings.

What does a useful quality rubric look like?

A useful rubric has three components. First, a clear definition of what is being assessed. For document review, this might be “relevance to claim elements” and “completeness of information capture.” For clinical work, “accuracy of classification” and “identification of edge cases.” For written work, “clarity,” “completeness,” “accuracy,” and “tone appropriateness.” Each dimension is named precisely so reviewers apply it consistently.

Second, a five-point scale. 1 is major errors or omissions. 2 is notable defects. 3 is acceptable but with minor gaps. 4 is good, few or no issues. 5 is excellent. Each point on the scale gets a calibrated definition specific to the task, so two assessors looking at the same work arrive at the same score.

Third, a sampling and aggregation protocol. A stratified sample of around thirty items per side gives enough power to detect material differences. Smaller samples are usable, but the conclusion is correspondingly weaker. Calculate the mean quality score for each side and compare. If the AI samples score lower on a dimension the firm cares about, that gap is the cost of the time savings.

Writing the rubric is the work most firms skip. Done once, the rubric is reusable across the lifetime of the deployment.

Why does the assessment need to be blinded?

Multiple studies in quality assessment show that when reviewers know whether work was done by a machine or a human, they systematically score it differently, even when the work is identical. Reviewers may apply stricter standards to AI work because they expect it to be perfect, or more lenient because they are surprised it is competent. Either bias contaminates the result.

Blinding is the standard correction. The assessor does not know which sample is AI-generated and which is human-generated. Ideally the firm uses an external assessor or a peer who is not involved in the AI procurement. For a budget-constrained SME, at minimum the assessor should be someone with no stake in whether the AI proves successful.

Done properly, a blinded assessment produces a finding the firm can defend. “Our blinded assessor scored thirty AI-reviewed documents and thirty human-reviewed documents on relevance and completeness. Mean scores were 4.1 versus 4.3 on relevance, 3.9 versus 4.4 on completeness.” That is interrogable. “The associates think the quality is fine” is not.

What does the research actually find about AI output quality?

Research from Microsoft Research, Stanford HAI, and MIT CSAIL on AI quality in professional services has been consistent. Quality assessment depends on the dimension being assessed. For tasks like summarisation or classification, AI often performs comparably to humans. For tasks requiring judgment or rare edge cases, AI often underperforms. Where AI operates in a well-trained domain, it sometimes outperforms humans.

The GitHub Copilot study on code generation quality is instructive. Researchers conducted controlled assessments of code produced with and without AI assistance, scored by independent reviewers who did not know the source. Results were mixed. In some tasks, AI-assisted code was qualitatively equivalent to human code. In others, it had subtle defects that were not immediately obvious but caused problems under specific conditions. The broader point is that output quality is multidimensional and context-dependent, not binary.

There is one important user-side finding worth flagging. Microsoft Research and Stanford HAI both find that users systematically overestimate AI output quality. They see plausible output, assume it is correct, and skip validation. Quality problems then surface only when downstream customers or processes encounter the errors at scale, often months after deployment.

The Anthropic and OpenAI evaluation literature reinforces the same point. Evaluation should be multidimensional, defined precisely, and include error analysis. Counting how often the AI is acceptable is not enough. Categorising the errors, by hallucination, omission, or misunderstanding, tells the firm whether the errors are systematic and addressable through process redesign or random and requiring ongoing review.

What sample size and protocol work for an SME?

For most SME deployments, thirty items per side is enough power to detect a meaningful difference. Stratify the sample across the actual mix of work the firm does, so the assessment reflects reality rather than the easiest cases. The assessor scores both sides against the rubric without knowing the source. The work takes a competent reviewer two to three working days for a single deployment.

The output is a mean score per dimension per side, with the gaps named explicitly. “AI is comparable on relevance, slightly weaker on completeness, weaker on edge-case identification.” That gives the firm three operational decisions to consider. Accept the gap because the time savings outweigh the quality cost. Implement a quality-assurance step (sampling, second-pass review) to close the gap. Or restrict the AI to use cases where the gap does not matter.

None of this requires a research-grade study. It requires the discipline to commission the assessment honestly and the willingness to act on the finding when the AI is weaker than the firm assumed.

If your AI deployment is six months in and you do not yet know how its output compares to your existing quality bar, that is the first measurement to commission. If you’d like to talk through how to scope a blinded quality assessment for your specific use case, book a conversation.

Sources

  • Microsoft Research, Stanford HAI, and MIT CSAIL on AI quality in professional services: quality assessment is multidimensional and context-dependent; AI is often comparable on standard metrics for routine tasks, weaker on judgment or rare edge cases, sometimes better on common patterns in well-trained domains. Source.
  • GitHub Copilot quality study on code generation: AI-assisted code is qualitatively equivalent to human code on some dimensions, with subtle defects on others under specific conditions. Source.
  • Microsoft Research and Stanford HAI on user perception: users systematically overestimate AI output quality, see plausible output, assume correctness, and skip validation. Source.
  • Anthropic and OpenAI evaluation literature: evaluation should be multidimensional, defined precisely, and include error analysis (hallucination, omission, misunderstanding). Source.
  • McKinsey & Company (2025). The State of AI Global Survey. 88 per cent of organisations now use AI in at least one function but only 39 per cent report enterprise-level EBIT impact, the measurement gap that maturity frameworks address. Source.
  • McKinsey & Company (2024). From Promise to Impact, How Companies Can Measure and Realise the Full Value of AI. Five-layer measurement framework spanning technical performance, adoption, operational KPIs, strategic outcomes, financial impact. Source.
  • MIT CISR (Woerner, Sebastian, Weill and Kaganer, 2025). Grow Enterprise AI Maturity for Bottom-Line Impact. Stage 3 enterprises achieve growth 11.3 percentage points and profit 8.7 percentage points above industry average; Stage 1 firms underperform on both. Source.
  • Boston Consulting Group (2025). Are You Generating Value from AI, The Widening Gap. Five per cent of future-built firms achieve five times the revenue gains and three times the cost reductions of peers, with 60 per cent reporting almost no material value from AI investment. Source.

Frequently asked questions

Why measure AI output quality if I already measure hours saved?

Because hours-saved without quality is half a ROI calculation. Faster output that is 5 percent worse on dimensions the firm cares about is hidden risk, particularly in regulated work. The downstream cost of poor quality (rework, complaints, compliance risk) often exceeds the time savings.

What does a usable AI quality rubric look like at SME scale?

Three to five dimensions defined precisely for the task, scored on a five-point scale with calibrated definitions for each point. Reviewers should not know whether the work was AI-generated or human-generated. Sample around thirty items per side, stratified across the work the firm actually does.

Why does the assessment need to be blinded?

Multiple studies show reviewers score identical work differently when they know whether AI or a human produced it. They apply stricter standards to AI work because they expect it to be perfect, or more lenient because they are surprised it is competent. Blinding removes the confounder.

What do the research findings say about AI output quality?

Microsoft Research, Stanford HAI, and MIT CSAIL studies find AI quality is multidimensional and context-dependent. AI is often comparable on standard metrics for routine tasks, often weaker on judgment or rare edge cases, sometimes better on common patterns. The GitHub Copilot quality study found code is functionally equivalent on some dimensions and has subtle defects on others. Users systematically overestimate AI quality.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation