Sampling rates for AI output review

An owner I was talking to last month had grown her team’s AI use from a few drafts a week to something close to forty outputs a day across proposals, client emails, internal summaries and financial workings. She was reviewing on instinct. When I asked her what percentage of the output actually got checked, she paused and said she had no idea. That’s the moment many SMEs end up in. The review rate stops being a number and becomes a feeling.

The fix isn’t a quality-management programme. It’s a sample rate that’s named, recorded, tied to volume and risk per category, and revisited every quarter. That single discipline is the difference between an AI quality control approach you can defend and one that’s just luck scaled up.

What is a sampling rate for AI output, and why does it need to be a number?

A sampling rate is the percentage of AI-generated outputs in a given category that get reviewed by a human before they go anywhere. It has to be a written number because anything else drifts. A team that starts at 100% review will settle, within months, at whatever feels manageable on a given week, without anyone deciding. A named rate forces a choice.

The choice itself isn’t the hard part. The hard part is that the absence of a number lets the rate move silently. Six months in, the team is checking around 10% of outputs and would tell you they’re checking everything important. Both can be true if “important” is defined on the day. Naming the rate makes the conversation real, and makes any future adjustment a decision rather than a drift.

Why does a constant review rate fail as volume grows?

A flat rate is either too heavy at low volume or too light at high volume, and at SME scale a firm usually moves through both within a year. Five outputs a week at 100% review is sustainable. Thirty outputs a week at 100% isn’t, and the team will silently drop the rate without telling anyone. The PCAOB audit standards make the same point for financial sampling, an undocumented rate is an unquantified risk.

The numbers behind the risk are not small. The BBC and European Broadcasting Union found 45% of AI assistant responses to news questions contained significant issues. An undocumented review rate of around 10% is letting through more than 30 erroneous outputs per 100 generated. The volume keeps growing. The instinct can’t keep up. The number has to.

What two variables should drive the rate at SME scale?

Volume per category and stakes per category. Those are the only two inputs that matter at this scale. Volume sets how much sampling is possible inside a realistic review budget. Stakes set how much of that budget each category deserves. The NIST AI Risk Management Framework treats this as the standard discipline, evaluate accuracy against representative test sets and disaggregate by category.

For an SME that means a separate rate per use case, not one number averaged across everything. Client proposals carry different stakes from internal meeting summaries, financial work for clients carries different stakes again. The OECD’s research on SME AI adoption makes the constraint side of this explicit, small firms face skill and resource limits that enterprise governance frameworks ignore. A category-by-category rate accommodates that. You don’t need to review everything. You need to know what fraction of each category you’re reviewing, and why.

What does the starting matrix usually look like?

For a typical services SME, the matrix lands in three bands. High-stakes work (client-facing documents, financial analysis, anything regulatory) starts at 50 to 100% depending on weekly volume. Medium-stakes work (internal process documents, employee-facing communications) starts at 20 to 50%. Low-stakes work (meeting summaries, brainstorming notes, first-draft outlines) starts at 5 to 10%.

The Vectara hallucination leaderboard shows model error rates varying from around 3% to over 10% across summary tasks, so the per-category rate is calibrating to where the risk actually lives, not to a firm-wide average. A 10-person services firm might produce 100 AI outputs a week. The matrix could be 50% on 20 weekly proposals, 100% on 5 financial documents for clients, 40% on 8 employee comms, 20% on 15 process memos, 10% on 30 meeting summaries, 5% on 25 ideation drafts. Blended that’s about 25% across the firm, which sounds low until you see the high-stakes categories sitting at 50 to 100%.

The blended figure is an output of the discipline, not the input. If you set the blend first and then back into the per-category rates, you’ll usually end up over-reviewing low-stakes work and under-reviewing the work that matters. Set the per-category rates against volume and stakes, then let the blended number fall where it falls. The ASQ Acceptable Quality Level framework gives the underlying logic, higher stakes require lower tolerated error rates, which usually means a higher sampling rate. The maths follows from the risk position, not the other way around.

Where should the rate live, who owns it, and how often is it reviewed?

The rate lives in one of three places, a quality policy document, an operations procedure, or a simple spreadsheet linked to your workflow tools. It matters that it exists in writing, not which of the three. One named person owns it. In a 10-person firm that’s usually an operations manager or senior team member. Their job is to make sure samples are taken, findings get logged, and the rate is revisited each quarter.

McKinsey’s 2025 AI survey found firms with documented validation processes were significantly more likely to capture material value from AI. The ICO’s UK guidance on AI and data protection asks for the same thing, document how systems are built, tested, deployed and monitored. Without a named owner, the rate drifts within a quarter. With one, the discipline holds because someone has it on their plate.

The quarterly review is the moment the rate changes on evidence. Half an hour with the log. What did we actually sample, what error rates did we find, did any category produce surprises, does anything need a temporary boost, are we logging properly. If financial documents were sampled at 100% and three errors appeared across 20 samples, that’s information. The team can hold at 100%, switch tools, add a checklist, or raise the rate elsewhere. If a category ran clean for two consecutive quarters, the rate can come down. If a new use case appeared, it gets added to the matrix. The decision is made on the data, not on mood.

The discipline scales down as well as up. A solo founder using AI for client work can run the same logic at smaller numbers, one or two reviews a week, a five-field log, a quarterly check. The numbers shift. The principle doesn’t. If you want help setting the matrix for your own AI use, book a conversation.

Sampling rates for AI output, what the volume should drive

Key takeaways

What is a sampling rate for AI output, and why does it need to be a number?

Why does a constant review rate fail as volume grows?

What two variables should drive the rate at SME scale?

What does the starting matrix usually look like?

Where should the rate live, who owns it, and how often is it reviewed?

Sources

Frequently asked questions

Is there a single review rate that works for an SME?

How do I know my sample is statistically meaningful?

Who should own the sampling rate in a ten-person firm?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Sampling rates for AI output, what the volume should drive

Key takeaways

What is a sampling rate for AI output, and why does it need to be a number?

Why does a constant review rate fail as volume grows?

What two variables should drive the rate at SME scale?

What does the starting matrix usually look like?

Where should the rate live, who owns it, and how often is it reviewed?

Sources

Frequently asked questions

Is there a single review rate that works for an SME?

How do I know my sample is statistically meaningful?

Who should own the sampling rate in a ten-person firm?

Ready to talk it through?

Related reading

Quality signals over time, how to spot when AI output is drifting

The two-person review threshold, when single-check AI evaluation is not enough

The ninety-day reflective audit on AI recommendations

If any of this sounds familiar, let's talk.