Sampling rates for AI output, what the volume should drive

An owner-operator at a desk marking categories on a printed weekly schedule with a pen, a laptop open beside it.
TL;DR

An SME AI review rate should be a named number, not a feeling. Set the rate per output category, anchored to weekly volume and the stakes of getting it wrong. Record it where the team can see it, name one owner, and revisit it every quarter against the errors you actually found. Owners who do this run quality control at maintenance cost. Owners who don't run on luck.

Key takeaways

- A constant review rate fails predictably, it starts at 100% and drifts down to whatever feels manageable, without anyone deciding. - Two variables should set the rate per category, weekly output volume and the stakes of an error in that category. - A typical SME starting matrix puts high-stakes work at 50 to 100%, medium-stakes at 20 to 50%, low-stakes at 5 to 10%. - The rate is only real if it lives somewhere documented, has a named owner, and is reviewed every quarter against logged findings. - Quarterly review is when you raise, lower, or temporarily boost rates on evidence, not on instinct.

An owner I was talking to last month had grown her team’s AI use from a few drafts a week to something close to forty outputs a day across proposals, client emails, internal summaries and financial workings. She was reviewing on instinct. When I asked her what percentage of the output actually got checked, she paused and said she had no idea. That’s the moment many SMEs end up in. The review rate stops being a number and becomes a feeling.

The fix isn’t a quality-management programme. It’s a sample rate that’s named, recorded, tied to volume and risk per category, and revisited every quarter. That single discipline is the difference between an AI quality control approach you can defend and one that’s just luck scaled up.

What is a sampling rate for AI output, and why does it need to be a number?

A sampling rate is the percentage of AI-generated outputs in a given category that get reviewed by a human before they go anywhere. It has to be a written number because anything else drifts. A team that starts at 100% review will settle, within months, at whatever feels manageable on a given week, without anyone deciding. A named rate forces a choice.

The choice itself isn’t the hard part. The hard part is that the absence of a number lets the rate move silently. Six months in, the team is checking around 10% of outputs and would tell you they’re checking everything important. Both can be true if “important” is defined on the day. Naming the rate makes the conversation real, and makes any future adjustment a decision rather than a drift.

Why does a constant review rate fail as volume grows?

A flat rate is either too heavy at low volume or too light at high volume, and at SME scale a firm usually moves through both within a year. Five outputs a week at 100% review is sustainable. Thirty outputs a week at 100% isn’t, and the team will silently drop the rate without telling anyone. The PCAOB audit standards make the same point for financial sampling, an undocumented rate is an unquantified risk.

The numbers behind the risk are not small. The BBC and European Broadcasting Union found 45% of AI assistant responses to news questions contained significant issues. An undocumented review rate of around 10% is letting through more than 30 erroneous outputs per 100 generated. The volume keeps growing. The instinct can’t keep up. The number has to.

What two variables should drive the rate at SME scale?

Volume per category and stakes per category. Those are the only two inputs that matter at this scale. Volume sets how much sampling is possible inside a realistic review budget. Stakes set how much of that budget each category deserves. The NIST AI Risk Management Framework treats this as the standard discipline, evaluate accuracy against representative test sets and disaggregate by category.

For an SME that means a separate rate per use case, not one number averaged across everything. Client proposals carry different stakes from internal meeting summaries, financial work for clients carries different stakes again. The OECD’s research on SME AI adoption makes the constraint side of this explicit, small firms face skill and resource limits that enterprise governance frameworks ignore. A category-by-category rate accommodates that. You don’t need to review everything. You need to know what fraction of each category you’re reviewing, and why.

What does the starting matrix usually look like?

For a typical services SME, the matrix lands in three bands. High-stakes work (client-facing documents, financial analysis, anything regulatory) starts at 50 to 100% depending on weekly volume. Medium-stakes work (internal process documents, employee-facing communications) starts at 20 to 50%. Low-stakes work (meeting summaries, brainstorming notes, first-draft outlines) starts at 5 to 10%.

The Vectara hallucination leaderboard shows model error rates varying from around 3% to over 10% across summary tasks, so the per-category rate is calibrating to where the risk actually lives, not to a firm-wide average. A 10-person services firm might produce 100 AI outputs a week. The matrix could be 50% on 20 weekly proposals, 100% on 5 financial documents for clients, 40% on 8 employee comms, 20% on 15 process memos, 10% on 30 meeting summaries, 5% on 25 ideation drafts. Blended that’s about 25% across the firm, which sounds low until you see the high-stakes categories sitting at 50 to 100%.

The blended figure is an output of the discipline, not the input. If you set the blend first and then back into the per-category rates, you’ll usually end up over-reviewing low-stakes work and under-reviewing the work that matters. Set the per-category rates against volume and stakes, then let the blended number fall where it falls. The ASQ Acceptable Quality Level framework gives the underlying logic, higher stakes require lower tolerated error rates, which usually means a higher sampling rate. The maths follows from the risk position, not the other way around.

Where should the rate live, who owns it, and how often is it reviewed?

The rate lives in one of three places, a quality policy document, an operations procedure, or a simple spreadsheet linked to your workflow tools. It matters that it exists in writing, not which of the three. One named person owns it. In a 10-person firm that’s usually an operations manager or senior team member. Their job is to make sure samples are taken, findings get logged, and the rate is revisited each quarter.

McKinsey’s 2025 AI survey found firms with documented validation processes were significantly more likely to capture material value from AI. The ICO’s UK guidance on AI and data protection asks for the same thing, document how systems are built, tested, deployed and monitored. Without a named owner, the rate drifts within a quarter. With one, the discipline holds because someone has it on their plate.

The quarterly review is the moment the rate changes on evidence. Half an hour with the log. What did we actually sample, what error rates did we find, did any category produce surprises, does anything need a temporary boost, are we logging properly. If financial documents were sampled at 100% and three errors appeared across 20 samples, that’s information. The team can hold at 100%, switch tools, add a checklist, or raise the rate elsewhere. If a category ran clean for two consecutive quarters, the rate can come down. If a new use case appeared, it gets added to the matrix. The decision is made on the data, not on mood.

The discipline scales down as well as up. A solo founder using AI for client work can run the same logic at smaller numbers, one or two reviews a week, a five-field log, a quarterly check. The numbers shift. The principle doesn’t. If you want help setting the matrix for your own AI use, book a conversation.

Sources

- NIST (2023). AI Risk Management Framework 1.0. Used for the principle that accuracy should be evaluated against representative test sets with results disaggregated by data segment. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf - PCAOB (2010). Auditing Standard AS 2315, Audit Sampling. Used for the principle that undocumented drift in sampling rates produces unquantified risk. https://pcaobus.org/oversight/standards/auditing-standards/details/AS2315 - McKinsey (2025). The state of AI. Used for the finding that AI high performers define clear processes for human validation and achieve materially more enterprise value. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai - BBC and European Broadcasting Union, via Bersin (2025). 45% of AI assistant responses to news questions contained significant issues. Used for the baseline error rate that makes a documented review rate necessary. https://joshbersin.com/2025/10/bbc-finds-that-45-of-ai-queries-produce-erroneous-answers/ - OECD (2025). AI adoption by small and medium-sized enterprises. Used for the constraints (skill and resource) that a proportionate sampling framework explicitly accommodates. https://www.oecd.org/content/dam/oecd/en/publications/reports/2025/12/ai-adoption-by-small-and-medium-sized-enterprises_9c48eae6/426399c1-en.pdf - Information Commissioner's Office. Guidance on AI and data protection. Used for the principle that organisations should document how AI systems are built, tested, deployed, and monitored. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ - Vectara (2025). Hallucination leaderboard for summary tasks. Used for the per-category error-rate variance that makes one-size-fits-all sampling unworkable. https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard - ASQ. Acceptable Quality Level (AQL) reference. Used for the principle that higher stakes require lower acceptable defect rates and therefore higher sampling rates. https://www.6sigma.us/six-sigma-in-focus/acceptable-quality-level-aql/ - University of Massachusetts Lowell (LaMorte). Sample size calculation reference. Used for the standard 95% confidence calculation that underpins any defensible sampling decision. https://www.uml.edu/docs/sample%20size%20calcs%20LaMorte_tcm18-37807.doc

Frequently asked questions

Is there a single review rate that works for an SME?

No, and that's the point. A blended rate of around 25% across a services firm is common, but the figure is meaningless without the breakdown underneath it. The same firm might be reviewing 100% of financial work for clients, 40% of employee-facing comms, and 10% of internal meeting summaries. The blended number is the output, not the input. Set the rate per category, then let the average fall where it falls.

How do I know my sample is statistically meaningful?

For an SME the honest answer is usually that you don't need formal statistical power, you need a documented rate applied consistently. Standard sampling theory says a 95% confidence level with a 5% margin needs around 384 items, which is well above what a small firm generates in a quarter. The practical test is whether a quarterly review of your sample shows error patterns clearly enough to act on. If it does, the rate is sufficient. If it doesn't, raise it.

Who should own the sampling rate in a ten-person firm?

One named person, usually an operations manager or senior team member. Their job isn't to do every review themselves, it's to make sure samples are actually selected (randomly, not by what's lying around), findings get logged, and the rate gets revisited each quarter against the data. Without a named owner the rate drifts in three months. With one, the discipline holds.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation