Picture a managing partner I’ll call Simon. Thirty-lawyer firm, six months into an AI document review tool. The reviews come back faster. The associates are happier. The vendor’s monthly report cites a 30 percent reduction in review time. Simon does not know whether the same level of nuance is being caught on complex factual issues. Nobody has independently checked. The next major client matter is in four weeks. Simon does not yet know what he does not know.
Output quality is half of AI ROI. Most SMEs measure the other half and skip this one. The reasons are familiar. Quality is multidimensional, harder to score than time, and easy to skip when the firm wants the rollout to look good.
Why is hours-saved half of an ROI calculation?
A 30 percent faster process with 5 percent false negatives on materially relevant work has not saved any time once the downstream cost lands. False negatives in a contract review become risk. False negatives in audit testing become compliance exposure. False negatives in clinical documentation become patient safety. Hours saved are real; the cost they create elsewhere is also real, and rarely visible until the consequence arrives.
The firm that reports “hours saved 30 percent” without a quality figure is reporting half a number. The other half is whether the output is good enough on the dimensions that matter. Both numbers belong on the same page.
The right response is to insist on the matching quality figure alongside hours-saved, before drawing any financial conclusion from the time savings.
What does a useful quality rubric look like?
A useful rubric has three components. First, a clear definition of what is being assessed. For document review, this might be “relevance to claim elements” and “completeness of information capture.” For clinical work, “accuracy of classification” and “identification of edge cases.” For written work, “clarity,” “completeness,” “accuracy,” and “tone appropriateness.” Each dimension is named precisely so reviewers apply it consistently.
Second, a five-point scale. 1 is major errors or omissions. 2 is notable defects. 3 is acceptable but with minor gaps. 4 is good, few or no issues. 5 is excellent. Each point on the scale gets a calibrated definition specific to the task, so two assessors looking at the same work arrive at the same score.
Third, a sampling and aggregation protocol. A stratified sample of around thirty items per side gives enough power to detect material differences. Smaller samples are usable, but the conclusion is correspondingly weaker. Calculate the mean quality score for each side and compare. If the AI samples score lower on a dimension the firm cares about, that gap is the cost of the time savings.
Writing the rubric is the work most firms skip. Done once, the rubric is reusable across the lifetime of the deployment.
Why does the assessment need to be blinded?
Multiple studies in quality assessment show that when reviewers know whether work was done by a machine or a human, they systematically score it differently, even when the work is identical. Reviewers may apply stricter standards to AI work because they expect it to be perfect, or more lenient because they are surprised it is competent. Either bias contaminates the result.
Blinding is the standard correction. The assessor does not know which sample is AI-generated and which is human-generated. Ideally the firm uses an external assessor or a peer who is not involved in the AI procurement. For a budget-constrained SME, at minimum the assessor should be someone with no stake in whether the AI proves successful.
Done properly, a blinded assessment produces a finding the firm can defend. “Our blinded assessor scored thirty AI-reviewed documents and thirty human-reviewed documents on relevance and completeness. Mean scores were 4.1 versus 4.3 on relevance, 3.9 versus 4.4 on completeness.” That is interrogable. “The associates think the quality is fine” is not.
What does the research actually find about AI output quality?
Research from Microsoft Research, Stanford HAI, and MIT CSAIL on AI quality in professional services has been consistent. Quality assessment depends on the dimension being assessed. For tasks like summarisation or classification, AI often performs comparably to humans. For tasks requiring judgment or rare edge cases, AI often underperforms. Where AI operates in a well-trained domain, it sometimes outperforms humans.
The GitHub Copilot study on code generation quality is instructive. Researchers conducted controlled assessments of code produced with and without AI assistance, scored by independent reviewers who did not know the source. Results were mixed. In some tasks, AI-assisted code was qualitatively equivalent to human code. In others, it had subtle defects that were not immediately obvious but caused problems under specific conditions. The broader point is that output quality is multidimensional and context-dependent, not binary.
There is one important user-side finding worth flagging. Microsoft Research and Stanford HAI both find that users systematically overestimate AI output quality. They see plausible output, assume it is correct, and skip validation. Quality problems then surface only when downstream customers or processes encounter the errors at scale, often months after deployment.
The Anthropic and OpenAI evaluation literature reinforces the same point. Evaluation should be multidimensional, defined precisely, and include error analysis. Counting how often the AI is acceptable is not enough. Categorising the errors, by hallucination, omission, or misunderstanding, tells the firm whether the errors are systematic and addressable through process redesign or random and requiring ongoing review.
What sample size and protocol work for an SME?
For most SME deployments, thirty items per side is enough power to detect a meaningful difference. Stratify the sample across the actual mix of work the firm does, so the assessment reflects reality rather than the easiest cases. The assessor scores both sides against the rubric without knowing the source. The work takes a competent reviewer two to three working days for a single deployment.
The output is a mean score per dimension per side, with the gaps named explicitly. “AI is comparable on relevance, slightly weaker on completeness, weaker on edge-case identification.” That gives the firm three operational decisions to consider. Accept the gap because the time savings outweigh the quality cost. Implement a quality-assurance step (sampling, second-pass review) to close the gap. Or restrict the AI to use cases where the gap does not matter.
None of this requires a research-grade study. It requires the discipline to commission the assessment honestly and the willingness to act on the finding when the AI is weaker than the firm assumed.
If your AI deployment is six months in and you do not yet know how its output compares to your existing quality bar, that is the first measurement to commission. If you’d like to talk through how to scope a blinded quality assessment for your specific use case, book a conversation.



