Spot-check sampling for AI output, the SME approach

A founder sitting at a kitchen table with a laptop open and a printed page beside it, pen in hand, working through a short list.
TL;DR

Spot-check sampling is the middle position between reviewing every AI output and reviewing none. It works at SME scale when the sample rate is tied to volume, stakes and recent error history, when selection is random or stratified rather than gut feel, and when findings are logged so the rate can be adjusted on evidence rather than mood.

Key takeaways

- Review-everything and review-nothing both fail at SME scale, structured sampling is the sustainable middle. - A defensible starting rate is grounded in output volume, stake category and recent error history, not intuition. - Three sampling logics work at SME scale, random per category, stratified by stakes or producer, and event-triggered when a specific failure has appeared. - A one-page weekly log of date, category, error type, severity and review time turns one-off checks into a feedback signal. - Adjust rates monthly on the evidence, raise them when errors cluster or tools change, lower them when a category goes clean for a month.

An owner I spoke with recently said her team produces about twenty AI-assisted outputs a day. Drafts, summaries, client emails, the occasional analysis. She reviews one in twenty, picked by feel, and she suspects she’s getting it wrong in both directions at once.

This is the middle question of evaluating AI output at SME scale. Review everything and the day disappears. Review nothing and errors reach clients. The position that works is structured sampling, which scales review effort to volume without consuming the week. What carries the work is the logic behind the selection, the discipline of recording what you find, and the willingness to adjust the rate when the data starts saying something.

Why review-everything and review-nothing both fail

Reviewing every AI output sounds careful but isn’t sustainable. Twenty outputs a day at five to ten minutes each is two to three hours of verification work daily, on top of everything else. Harvard’s Center for Long-Term Cybersecurity finds that sustained repetitive verification produces fatigue errors, reviewers apply less rigour late in a batch and miss what they would have caught earlier. Reviewing everything ends up reviewing nothing well.

The opposite is worse. The BBC and European Broadcasting Union jointly tested leading AI systems and found 45% of responses contained significant issues. Medical researchers examining ChatGPT-generated references found 47% were fabricated entirely, 46% authentic but inaccurate, only 7% accurate. Even Google’s Gemini 3 Pro still hallucinates in around 0.7% of responses, and many enterprise deployments run above 25%. Twenty AI-assisted outputs daily with zero review will ship between four and a hundred errors a month depending on the system and the task. Structured sampling sits in the gap, accepts that not every output carries equal risk, and puts the effort where the risk is.

How to set a defensible starting sample rate

The rate has to be grounded in something measurable, otherwise it’s a guess dressed up as a discipline. Three inputs do the work, volume, stakes and recent error history. Volume sets the budget. Stakes weight it. Error history calibrates the weighting. An owner reviewing “one in twenty by feel” has no basis to know whether five per cent is too high or too low for what they actually produce.

A business producing around twenty AI-assisted outputs per week, allocating roughly two hours weekly to review, can sample ten to twelve outputs at a depth that catches meaningful errors. That puts the starting rate at five to ten per cent. The NIST AI Risk Management Framework adds the second layer, classifying outputs by consequence severity. High-stakes outputs warrant more review than routine internal summaries. Within a two-hour budget you might put five or six reviews on high-stakes outputs and five or six on lower-stakes, which lifts the high-stakes rate to fifteen to twenty per cent and drops the routine rate to two to five per cent. Recent error history is the third anchor. If you’ve logged what intuition-based review found over the past month, that data is your baseline.

Three sampling logics that work at SME scale

Once the rate is set, the next decision is which outputs actually get reviewed. Three logics work at SME scale without software, and a small business will usually run all three in rotation. The discipline is to pick by logic, not by gut. Gut selection introduces a bias toward what you already suspect and a blind spot around what you don’t.

Random selection per category is the workhorse. Count outputs in each stake category each week, assign a random number to each, sort, take the top N. The ASQ guidance on statistical sampling makes the case, when every item has an equal chance of selection, the sample is reliable for estimating the population. Stratified sampling layers on when the population isn’t uniform. Two people producing outputs, or two different AI tools being used for different jobs, means error patterns may differ by stratum and lumping the data hides them. Three reviews from Person A and seven from Person B if their volumes split that way. Event-triggered review is the third logic, raise the rate temporarily for a specific category when a failure has appeared. Three fabricated stats in a fortnight? Bump proposals to fifteen per cent for the next two weeks, then return to baseline once two cycles run clean.

What to record, and how to adjust on the evidence

A review that isn’t recorded is worth less than the time it took. Five fields are enough to make the log useful, date, category, error type if any, severity, and time taken. Severity does the heavy lifting over time, “would have reached the client” is a different signal from “wouldn’t have meaningfully affected the decision”. For an SME this isn’t a system, it’s a one-page shared spreadsheet.

The ICO’s guide to AI audits recommends the same discipline at scale, track error rates over time and let trends inform the controls. The log answers the question intuition can’t, is this a fluke or a pattern, and a month of weekly entries usually says.

After four to eight weeks of consistent logging the data starts answering whether the rate is right. A sample of forty to eighty items gives reasonably stable estimates of population error rates within ten to fifteen percentage points at 95% confidence. The adjustment rule is simple. Calculate the observed error rate in each category. Higher than your starting assumption, raise the rate by twenty to fifty per cent for the next month. Lower than assumed, drop it by ten to thirty per cent, with a floor that total weekly review time stays under three hours. Roughly equal, hold steady. This mirrors statistical process control, which has run in manufacturing for decades on the same principle.

Four situations justify a temporary rate increase regardless of the rolling average. A single high-consequence error, raise the category by fifty per cent for two weeks. An AI tool update, raise temporarily because new versions sometimes introduce new behaviours and CIO Magazine’s reporting on agentic systems makes the point that they drift quietly rather than failing suddenly. Reviewed outputs taking longer than usual, that itself signals a change in quality. Feedback from a client flagging something the sampling missed, raise immediately and stay raised until confidence is restored.

Spot-check sampling is one piece of a wider evaluation discipline. The rubric pattern, where outputs are scored blind against a small set of criteria, sits naturally on top, the sample tells you which outputs to review and the rubric tells you what good looks like. The two-person review threshold is a separate decision driven by stakes. The quarterly reflective audit is the long-loop version of the same idea.

The sampling discipline scales down as well as up. A solo founder using AI for client work can run the same logic at smaller numbers, one or two reviews a week, the same five-field log, the same monthly adjustment. The numbers shift, the discipline doesn’t. Over time, as the team gets better with the tools and the tools themselves improve, error rates drift down and the rate can come down with them. The log becomes the mechanism that tells you whether the decline is real or whether you’re just looking less carefully.

If you want to talk through how this would land in your own setup, book a conversation.

Sources

- BBC and European Broadcasting Union (2025). Joint study finding 45% of AI assistant responses contained significant issues when tested on news questions. Used for the baseline error rate that makes review necessary. https://joshbersin.com/2025/10/bbc-finds-that-45-of-ai-queries-produce-erroneous-answers/ - Bhattacharyya et al, PMC (2023). Analysis of ChatGPT-generated medical references finding 47% fabricated, 46% authentic but inaccurate, 7% accurate. Used for the citation-fabrication risk in professional outputs. https://pmc.ncbi.nlm.nih.gov/articles/PMC10277170/ - NIST (2024). AI Risk Management Framework, guidance on classifying applications by consequence severity. Used for the stake-tiering logic behind sampling rates. https://www.nist.gov/artificial-intelligence/ai-research-identifying-managing-harmful-bias-ai - NIST (2026). New report on expanding the AI evaluation toolbox with statistical models. Used for the principle that statistical validity needs an explicit model and disclosed assumptions. https://www.nist.gov/news-events/news/2026/02/new-report-expanding-ai-evaluation-toolbox-statistical-models - ASQ (American Society for Quality). Stratification guidance. Used for the rationale behind stratified sampling by producer or tool. https://asq.org/quality-resources/stratification - Information Commissioner's Office (2023). A guide to AI audits. Used for the recommendation to track error rates over time and let trends inform the controls. https://ico.org.uk/media2/migrated/4022651/a-guide-to-ai-audits.pdf - Stanford HAI (2025). AI Index Report. Used for the 78% organisational AI usage baseline and the population-level context for SME sampling decisions. https://hai.stanford.edu/ai-index/2025-ai-index-report - SPC for Excel. Control chart rules and interpretation. Used for the statistical-process-control parallel that grounds rate adjustment on evidence. https://www.spcforexcel.com/knowledge/control-chart-basics/control-chart-rules-interpretation/ - CIO Magazine (2025). On agentic AI systems drifting quietly over time. Used for the warning-sign discipline around tool updates and gradual quality decline. https://www.cio.com/article/4134051/agentic-ai-systems-dont-fail-suddenly-they-drift-over-time.html - Deloitte. Trustworthy AI governance in practice. Used for the principle that governance is ongoing monitoring and iterative adjustment. https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/articles/trustworthy-ai-governance-in-practice.html

Frequently asked questions

What sample rate should I start with?

For a services business producing around twenty AI-assisted outputs per week, start at five to ten per cent overall, weighted toward higher-stakes categories. Allocate around two hours weekly to the review work itself. Within that budget, push the rate up to fifteen to twenty per cent for high-stakes outputs like client proposals or regulatory documents, and down to two to five per cent for routine internal summaries. The exact number matters less than the logic behind it.

How do I pick which specific outputs to review each week?

Avoid choosing by gut feel, that introduces bias toward outputs you already suspect. Use random selection within each stake category, a simple random number generator and a sort works. If different people or tools are producing outputs, stratify the sample so each is represented in rough proportion. If a specific failure has shown up in the last fortnight, raise the rate temporarily for that category and target it.

When should I change the sample rate?

After roughly four to eight weeks of consistent logging, you have enough data to adjust on evidence. If more than half of sampled outputs in a category contain errors, raise the rate by around fifty per cent for the next cycle. If under one in ten contain errors across a full month, reduce by twenty to thirty per cent. Raise temporarily after any AI tool update, after a single high-consequence error, or after a client flags something the sampling missed.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation