Spot-check sampling for AI output

An owner I spoke with recently said her team produces about twenty AI-assisted outputs a day. Drafts, summaries, client emails, the occasional analysis. She reviews one in twenty, picked by feel, and she suspects she’s getting it wrong in both directions at once.

This is the middle question of evaluating AI output at SME scale. Review everything and the day disappears. Review nothing and errors reach clients. The position that works is structured sampling, which scales review effort to volume without consuming the week. What carries the work is the logic behind the selection, the discipline of recording what you find, and the willingness to adjust the rate when the data starts saying something.

Why review-everything and review-nothing both fail

Reviewing every AI output sounds careful but isn’t sustainable. Twenty outputs a day at five to ten minutes each is two to three hours of verification work daily, on top of everything else. Harvard’s Center for Long-Term Cybersecurity finds that sustained repetitive verification produces fatigue errors, reviewers apply less rigour late in a batch and miss what they would have caught earlier. Reviewing everything ends up reviewing nothing well.

The opposite is worse. The BBC and European Broadcasting Union jointly tested leading AI systems and found 45% of responses contained significant issues. Medical researchers examining ChatGPT-generated references found 47% were fabricated entirely, 46% authentic but inaccurate, only 7% accurate. Even Google’s Gemini 3 Pro still hallucinates in around 0.7% of responses, and many enterprise deployments run above 25%. Twenty AI-assisted outputs daily with zero review will ship between four and a hundred errors a month depending on the system and the task. Structured sampling sits in the gap, accepts that not every output carries equal risk, and puts the effort where the risk is.

How to set a defensible starting sample rate

The rate has to be grounded in something measurable, otherwise it’s a guess dressed up as a discipline. Three inputs do the work, volume, stakes and recent error history. Volume sets the budget. Stakes weight it. Error history calibrates the weighting. An owner reviewing “one in twenty by feel” has no basis to know whether five per cent is too high or too low for what they actually produce.

A business producing around twenty AI-assisted outputs per week, allocating roughly two hours weekly to review, can sample ten to twelve outputs at a depth that catches meaningful errors. That puts the starting rate at five to ten per cent. The NIST AI Risk Management Framework adds the second layer, classifying outputs by consequence severity. High-stakes outputs warrant more review than routine internal summaries. Within a two-hour budget you might put five or six reviews on high-stakes outputs and five or six on lower-stakes, which lifts the high-stakes rate to fifteen to twenty per cent and drops the routine rate to two to five per cent. Recent error history is the third anchor. If you’ve logged what intuition-based review found over the past month, that data is your baseline.

Three sampling logics that work at SME scale

Once the rate is set, the next decision is which outputs actually get reviewed. Three logics work at SME scale without software, and a small business will usually run all three in rotation. The discipline is to pick by logic, not by gut. Gut selection introduces a bias toward what you already suspect and a blind spot around what you don’t.

Random selection per category is the workhorse. Count outputs in each stake category each week, assign a random number to each, sort, take the top N. The ASQ guidance on statistical sampling makes the case, when every item has an equal chance of selection, the sample is reliable for estimating the population. Stratified sampling layers on when the population isn’t uniform. Two people producing outputs, or two different AI tools being used for different jobs, means error patterns may differ by stratum and lumping the data hides them. Three reviews from Person A and seven from Person B if their volumes split that way. Event-triggered review is the third logic, raise the rate temporarily for a specific category when a failure has appeared. Three fabricated stats in a fortnight? Bump proposals to fifteen per cent for the next two weeks, then return to baseline once two cycles run clean.

What to record, and how to adjust on the evidence

A review that isn’t recorded is worth less than the time it took. Five fields are enough to make the log useful, date, category, error type if any, severity, and time taken. Severity does the heavy lifting over time, “would have reached the client” is a different signal from “wouldn’t have meaningfully affected the decision”. For an SME this isn’t a system, it’s a one-page shared spreadsheet.

The ICO’s guide to AI audits recommends the same discipline at scale, track error rates over time and let trends inform the controls. The log answers the question intuition can’t, is this a fluke or a pattern, and a month of weekly entries usually says.

After four to eight weeks of consistent logging the data starts answering whether the rate is right. A sample of forty to eighty items gives reasonably stable estimates of population error rates within ten to fifteen percentage points at 95% confidence. The adjustment rule is simple. Calculate the observed error rate in each category. Higher than your starting assumption, raise the rate by twenty to fifty per cent for the next month. Lower than assumed, drop it by ten to thirty per cent, with a floor that total weekly review time stays under three hours. Roughly equal, hold steady. This mirrors statistical process control, which has run in manufacturing for decades on the same principle.

Four situations justify a temporary rate increase regardless of the rolling average. A single high-consequence error, raise the category by fifty per cent for two weeks. An AI tool update, raise temporarily because new versions sometimes introduce new behaviours and CIO Magazine’s reporting on agentic systems makes the point that they drift quietly rather than failing suddenly. Reviewed outputs taking longer than usual, that itself signals a change in quality. Feedback from a client flagging something the sampling missed, raise immediately and stay raised until confidence is restored.

Spot-check sampling is one piece of a wider evaluation discipline. The rubric pattern, where outputs are scored blind against a small set of criteria, sits naturally on top, the sample tells you which outputs to review and the rubric tells you what good looks like. The two-person review threshold is a separate decision driven by stakes. The quarterly reflective audit is the long-loop version of the same idea.

The sampling discipline scales down as well as up. A solo founder using AI for client work can run the same logic at smaller numbers, one or two reviews a week, the same five-field log, the same monthly adjustment. The numbers shift, the discipline doesn’t. Over time, as the team gets better with the tools and the tools themselves improve, error rates drift down and the rate can come down with them. The log becomes the mechanism that tells you whether the decline is real or whether you’re just looking less carefully.

If you want to talk through how this would land in your own setup, book a conversation.

Spot-check sampling for AI output, the SME approach

Key takeaways

Why review-everything and review-nothing both fail

How to set a defensible starting sample rate

Three sampling logics that work at SME scale

What to record, and how to adjust on the evidence

Sources

Frequently asked questions

What sample rate should I start with?

How do I pick which specific outputs to review each week?

When should I change the sample rate?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Spot-check sampling for AI output, the SME approach

Key takeaways

Why review-everything and review-nothing both fail

How to set a defensible starting sample rate

Three sampling logics that work at SME scale

What to record, and how to adjust on the evidence

Related concepts to hold alongside this

Sources

Frequently asked questions

What sample rate should I start with?

How do I pick which specific outputs to review each week?

When should I change the sample rate?

Ready to talk it through?

Related reading

Quality signals over time, how to spot when AI output is drifting

The two-person review threshold, when single-check AI evaluation is not enough

Sampling rates for AI output, what the volume should drive

If any of this sounds familiar, let's talk.