Plausible nonsense, the AI numbers most owners miss

Plausible nonsense, the AI numbers problem most owners cannot see

May 11, 2026

TL;DR

Plausible-nonsense AI numbers have the right magnitude, precision and shape for their context, but no traceable source. They show up in cash-flow forecasts, market-sizing estimates, headcount benchmarks and performance metrics. Three portable tests catch the bulk of them, precision-mismatch, source-traceability, and a magnitude check against an unrelated reference. A simple rule on checking intensity, spot-check, sample-check, or recompute, lets small firms deploy verification capacity where stakes warrant it.

Key takeaways

- Plausible-nonsense AI numbers have the right magnitude and precision for the context but no auditable source, the failure that travels into board packs and funding applications without being caught. - The mechanism is statistical, large language models generate figures probabilistically to fit the learned shape of their domain, including its precision conventions, which is what makes a generated figure feel as solid as an extracted one. - The precision-mismatch test asks whether the decimal precision of an AI number exceeds the precision actually present in its inputs, a £47.3m market-sizing estimate from a rough population figure and an adoption rate is signalling false certainty. - The source-traceability test asks whether each significant figure in an AI output can be annotated with a specific origin, payroll system, signed contracts, vendor price list, or "best estimate of renewal rate", figures without a source category are candidates for verification. - The magnitude-reference test cross-checks an AI figure against two or three independent reference points, an AI operations-cost estimate is credible only if it sits inside the range that comparable-sized firms actually report.

The cash-flow projection in the board pack looked completely reasonable. Twelve months across the top, opening and closing balances down the side, expense categories at the right magnitude for a thirty-person services firm, every line rounded to the nearest thousand. The finance lead spotted the problem on the second read. A £40,000 line item for “professional fees, regulatory” did not match anything on the books, did not match the signed-contract pipeline, and did not survive a single phone call to the partner who handles compliance. The AI tool that generated the projection had built a number that fit the shape of the document, not a number extracted from the firm’s actual data.

That is the failure mode this post is about. The AI did not produce a wildly wrong figure that announced itself. It produced a credible one, with the right magnitude, the right precision, and the right shape for the surrounding context. The owner who can tell a generated number from a verified one has an evaluation discipline. The owner who cannot is making decisions on figures they cannot defend if anyone challenges them.

What does plausible nonsense actually look like?

A number with the right shape and no traceable source. A twelve-month cash-flow forecast where every line is rounded to the nearest thousand, every monthly opening matches the prior closing, every cost category sits at a sensible level, and a particular £40,000 figure cannot be traced to any contract, ledger, or documented assumption. The figure is wrong in the way that survives a first read.

Four use cases produce plausible-nonsense numbers in SMEs more than the rest. Financial projections, where revenue and cost lines are generated rather than extracted from the books. Market-sizing estimates, where a TAM figure is constructed to fit the statistical profile of similar estimates rather than calculated from the inputs supplied. Headcount benchmarks, where a “best practice” staffing ratio appears without any underlying industry survey behind it. Performance metrics, where a figure like “engagement increased 12 per cent” reads narrow enough to imply measurement but cannot be tied to a single analytics platform when challenged. The unifying pattern is shape without ancestry, a number that fits its context perfectly and has no auditable origin.

Why do AI numbers look so right when they are wrong?

Because large language models generate figures the same way they generate words, probabilistically, by selecting outputs that fit the statistical patterns of their training data. When asked for a cash-flow forecast, the model does not retrieve from a database. It produces figures that sit credibly inside the distribution of forecasts it has seen, with the rounding, magnitude, and precision the domain expects. Generated, not calculated.

The published evidence sizes the gap. Stanford’s analysis of leading AI legal-research tools, even with retrieval-augmented generation in place, found hallucination rates on citations and case references above 17 per cent. A Thomson Reuters study of AI reading company financial filings found error rates of around 9 per cent on structured XBRL inputs and 16 to 18 per cent on plain-text filings, the format of the input materially changing whether the AI extracts a real figure or produces a plausible substitute. Many of those errors are not full fabrications, they are misreadings, the wrong line item picked from a complex table, a “millions” read as “thousands”, a footnote definition missed. A wrong number lifted from a real document passes inspection more easily than an entirely invented one, because it exists in the source and the error sits only in which version got selected.

Which three tests catch plausible nonsense in practice?

Three portable tests catch the bulk of plausible-nonsense numbers without a finance team or a verification tool. The precision-mismatch test asks whether the figure’s precision exceeds the precision of its inputs. The source-traceability test asks whether each figure can be annotated with a specific origin. The magnitude-reference test cross-checks the figure against two or three independent benchmarks. Each takes thirty seconds.

The precision-mismatch test is the fastest. Headcount planning is almost never resolved to 0.1 FTE, so an AI forecast of 7.3 operations staff is signalling false precision, the underlying data on current headcount, attrition, and hiring is known only to the nearest whole person. A market-sizing estimate returning £47.3 million from rough population data and an adoption rate is showing decimal precision the inputs cannot support. The fix is to ask where the decimal came from. If the answer is vague, round the figure to match the precision actually present in the inputs.

The source-traceability test is the most useful and the hardest to skip if applied honestly. For every significant figure in an AI output, the owner should be able to point to a source category, payroll system, signed contracts, vendor price list, internal best estimate of renewal rate. A revenue line of £180,000 in Month 3 should be traceable to “two signed contracts at £90,000 each”. A figure without a source category is generated rather than extracted, and is a candidate for verification or replacement. The magnitude-reference test sits alongside, cross-checking the figure against unrelated benchmarks. An AI estimate that a ten-person software firm should spend £420,000 a year on operations staff is credible only if comparable-firm surveys put the band in that range. If the surveys say £200,000 to £350,000, the AI figure is an outlier worth examining.

When should owners spot-check, sample, or recompute from source?

Verification intensity should match the stakes, because no small firm has the capacity to recompute everything. Spot-check a single figure when it is high impact and sits inside your domain expertise. Sample-check three to five figures when the AI output is a multi-line document. Recompute from source data when the figure will drive a strategic decision or appear in a funding application.

Spot-checking works well when you know the territory. A founder who has run the business for a decade can usually tell whether a projected 15 per cent Q4 margin aligns with the cost structure and seasonality they have lived with. Sample-checking is the move for documents with many related figures, three to five randomly selected lines verified against payroll, supplier invoices, or the bank statement is usually enough to surface systematic issues if any are present. Recomputing is reserved for the high-stakes work, anything you would be challenged on by a lender, an investor, a regulator, or a buyer in due diligence. A simple spreadsheet built from documented inputs is almost always more defensible than the AI output, and the time cost is repaid the moment the figure is scrutinised.

How should the team know the difference between generated and verified numbers?

By labelling. Every number in a planning document should carry a tag indicating its origin, “AI-generated from inputs as of [date]”, “Extracted from [system]”, “Calculated from [specified inputs]”, or “Verified by [person/date]”. The label travels with the figure through decks and prevents the silent merging of generated forecasts with extracted-from-contracts baselines. A board pack that conflates the two has lost its data lineage.

The rule on which kind belongs where is straightforward. AI-generated figures are fine for brainstorming, scenario modelling, and identifying questions worth investigating further. They are out of place in funding applications, regulatory filings, board presentations to external parties, or any communication where inaccuracy carries material risk. The same figure can move through life stages, an AI estimate becomes the starting point for a conversation, the conversation produces a documented calculation, the documented calculation becomes the figure that reaches the stakeholder. The discipline is to treat AI output as draft pending verification by default, not as a final number with the verification step quietly skipped. The Kyriba 2025 CFO survey found 76 per cent of finance leaders already concerned about AI accuracy and 61 per cent admitting to second-guessing their data monthly even without AI in the loop, the labelling rule formalises a judgement they are already making informally.

If you want a sounding board on where AI-generated numbers are already travelling through your own decision documents, book a conversation.

Sources

- NIST (2026). Expanding the AI Evaluation Toolbox with Statistical Models. Cited for the principle that AI output evaluation requires explicit metadata and labelling conventions that distinguish generated figures from extracted or verified ones. https://www.nist.gov/news-events/news/2026/02/new-report-expanding-ai-evaluation-toolbox-statistical-models - PR Newswire / Precisely (2025). Companies Are Scaling AI on Data They Don't Trust, study of 350 senior finance and IT executives finding 47 per cent made material decisions on inaccurate or unverified data in the past twelve months. Cited for the structural finding that automation correlates with reduced manual verification capacity. https://www.prnewswire.com/news-releases/companies-are-scaling-ai-on-data-they-dont-trust-new-study-finds-302761641.html - Stanford HAI (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Cited for the finding that retrieval-augmented AI legal tools still hallucinate citations and case references at rates above 17 per cent. https://law.stanford.edu/publications/hallucination-free-assessing-the-reliability-of-leading-ai-legal-research-tools/ - Thomson Reuters Tax (2026). XBRL Cuts AI Errors in Reading Company Filings, Study Finds. Cited for the input-format dependence of AI financial-document error rates, 9 per cent on structured XBRL versus 16 to 18 per cent on plain-text filings. https://tax.thomsonreuters.com/news/xbrl-cuts-ai-errors-in-reading-company-filings-study-finds/ - Kyriba (2025). CFO Survey 2025, AI-Driven Solutions. Cited for the 76 per cent of finance leaders expressing concern about the accuracy of AI-generated financial outputs and the 61 per cent who second-guess their data at least monthly even without AI involvement. https://www.kyriba.com/resource/cfo-survey-2025-ai-driven-solutions/ - Voxco (2024). How to Avoid False Precision in Survey Research. Cited for the precision-mismatch principle, the conventional research-methodology guidance that decimal places are commonly misused to convey false certainty when inputs do not support them. https://www.voxco.com/resources/how-to-avoid-fostering-false-precision - Lucid (2025). Checklist for Validating AI Financial Systems. Cited for the source-traceability principle that every significant figure in an AI output should be annotated with its origin category before reaching a decision context. https://www.lucid.now/blog/checklist-validating-ai-financial-systems/ - NetSuite (2025). Financial Forecast AI, Practical Guidance. Cited for the magnitude-reference principle, that AI forecasts should be cross-checked against independent reference points such as comparable-firm surveys and sector growth rates. https://www.netsuite.com/portal/resource/articles/financial-management/financial-forecast-ai.shtml - OECD (2025). AI Adoption by Small and Medium-Sized Enterprises. Cited for the wider context that SME AI adoption is rising faster than verification capacity inside small firms, and that the gap is the structural source of the plausible-nonsense problem. https://www.oecd.org/content/dam/oecd/en/publications/reports/2025/12/ai-adoption-by-small-and-medium-sized-enterprises_9c48eae6/426399c1-en.pdf - Harvard Business Review (2026). Decision-Making by Consensus Doesn't Work in the AI Era. Cited for the principle that explicit attention to data lineage and source category in AI-supported decisions correlates with better outcomes and faster error detection. https://hbr.org/2026/04/decision-making-by-consensus-doesnt-work-in-the-ai-era

Frequently asked questions

How is this different from AI hallucination in general?

Plausible nonsense is the specific hallucination shape that survives a first read. A wildly wrong number is caught on sight, the dangerous version respects the magnitude and rounding conventions of its domain and reads as professionally produced work. Stanford's analysis of legal AI tools found hallucination rates above 17 per cent even with retrieval-augmented generation, and finance-document studies put the error rate at 9 to 18 per cent depending on input format. The pattern is not that AI invents wild figures, it is that AI invents credible ones.

Which AI use cases produce plausible nonsense most often in SMEs?

Four show up routinely. Cash-flow forecasts where the AI generates a month-by-month table that looks reasonable but contains line items that were never extracted from the books. Market-sizing estimates where a TAM figure is constructed to fit the statistical envelope of similar estimates rather than calculated from population and adoption inputs. Headcount benchmarks where a staffing ratio is produced with no underlying industry survey behind it. Performance metrics where an "engagement rate" or "conversion uplift" reads narrow enough to imply measurement but cannot be tied to a single underlying source.

How much verification can a small firm actually afford to do?

Less than the problem warrants, which is why intensity has to be calibrated. Spot-check a single high-impact figure when it sits inside your domain expertise, an experienced owner can often tell whether a 15 per cent Q4 margin projection is plausible against their cost structure. Sample-check three to five figures across a multi-line document like a cash-flow projection. Recompute from source only for figures driving strategic decisions, funding applications, or anything where you would be challenged on your working. A Kyriba 2025 CFO survey found 76 per cent of finance leaders are already concerned about AI output accuracy, the framework directs the verification capacity you have where it matters.

Written by Dr Dave Heath, AI consultant and business strategist.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Plausible nonsense, the AI numbers problem most owners cannot see

Key takeaways

What does plausible nonsense actually look like?

Why do AI numbers look so right when they are wrong?

Which three tests catch plausible nonsense in practice?

When should owners spot-check, sample, or recompute from source?

How should the team know the difference between generated and verified numbers?

Sources

Frequently asked questions

How is this different from AI hallucination in general?

Which AI use cases produce plausible nonsense most often in SMEs?

How much verification can a small firm actually afford to do?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Plausible nonsense, the AI numbers problem most owners cannot see

Key takeaways

What does plausible nonsense actually look like?

Why do AI numbers look so right when they are wrong?

Which three tests catch plausible nonsense in practice?

When should owners spot-check, sample, or recompute from source?

How should the team know the difference between generated and verified numbers?

Sources

Frequently asked questions

How is this different from AI hallucination in general?

Which AI use cases produce plausible nonsense most often in SMEs?

How much verification can a small firm actually afford to do?

Ready to talk it through?

Related reading

Quality signals over time, how to spot when AI output is drifting

The two-person review threshold, when single-check AI evaluation is not enough

Sampling rates for AI output, what the volume should drive

If any of this sounds familiar, let's talk.