The cash-flow projection in the board pack looked completely reasonable. Twelve months across the top, opening and closing balances down the side, expense categories at the right magnitude for a thirty-person services firm, every line rounded to the nearest thousand. The finance lead spotted the problem on the second read. A £40,000 line item for “professional fees, regulatory” did not match anything on the books, did not match the signed-contract pipeline, and did not survive a single phone call to the partner who handles compliance. The AI tool that generated the projection had built a number that fit the shape of the document, not a number extracted from the firm’s actual data.
That is the failure mode this post is about. The AI did not produce a wildly wrong figure that announced itself. It produced a credible one, with the right magnitude, the right precision, and the right shape for the surrounding context. The owner who can tell a generated number from a verified one has an evaluation discipline. The owner who cannot is making decisions on figures they cannot defend if anyone challenges them.
What does plausible nonsense actually look like?
A number with the right shape and no traceable source. A twelve-month cash-flow forecast where every line is rounded to the nearest thousand, every monthly opening matches the prior closing, every cost category sits at a sensible level, and a particular £40,000 figure cannot be traced to any contract, ledger, or documented assumption. The figure is wrong in the way that survives a first read.
Four use cases produce plausible-nonsense numbers in SMEs more than the rest. Financial projections, where revenue and cost lines are generated rather than extracted from the books. Market-sizing estimates, where a TAM figure is constructed to fit the statistical profile of similar estimates rather than calculated from the inputs supplied. Headcount benchmarks, where a “best practice” staffing ratio appears without any underlying industry survey behind it. Performance metrics, where a figure like “engagement increased 12 per cent” reads narrow enough to imply measurement but cannot be tied to a single analytics platform when challenged. The unifying pattern is shape without ancestry, a number that fits its context perfectly and has no auditable origin.
Why do AI numbers look so right when they are wrong?
Because large language models generate figures the same way they generate words, probabilistically, by selecting outputs that fit the statistical patterns of their training data. When asked for a cash-flow forecast, the model does not retrieve from a database. It produces figures that sit credibly inside the distribution of forecasts it has seen, with the rounding, magnitude, and precision the domain expects. Generated, not calculated.
The published evidence sizes the gap. Stanford’s analysis of leading AI legal-research tools, even with retrieval-augmented generation in place, found hallucination rates on citations and case references above 17 per cent. A Thomson Reuters study of AI reading company financial filings found error rates of around 9 per cent on structured XBRL inputs and 16 to 18 per cent on plain-text filings, the format of the input materially changing whether the AI extracts a real figure or produces a plausible substitute. Many of those errors are not full fabrications, they are misreadings, the wrong line item picked from a complex table, a “millions” read as “thousands”, a footnote definition missed. A wrong number lifted from a real document passes inspection more easily than an entirely invented one, because it exists in the source and the error sits only in which version got selected.
Which three tests catch plausible nonsense in practice?
Three portable tests catch the bulk of plausible-nonsense numbers without a finance team or a verification tool. The precision-mismatch test asks whether the figure’s precision exceeds the precision of its inputs. The source-traceability test asks whether each figure can be annotated with a specific origin. The magnitude-reference test cross-checks the figure against two or three independent benchmarks. Each takes thirty seconds.
The precision-mismatch test is the fastest. Headcount planning is almost never resolved to 0.1 FTE, so an AI forecast of 7.3 operations staff is signalling false precision, the underlying data on current headcount, attrition, and hiring is known only to the nearest whole person. A market-sizing estimate returning £47.3 million from rough population data and an adoption rate is showing decimal precision the inputs cannot support. The fix is to ask where the decimal came from. If the answer is vague, round the figure to match the precision actually present in the inputs.
The source-traceability test is the most useful and the hardest to skip if applied honestly. For every significant figure in an AI output, the owner should be able to point to a source category, payroll system, signed contracts, vendor price list, internal best estimate of renewal rate. A revenue line of £180,000 in Month 3 should be traceable to “two signed contracts at £90,000 each”. A figure without a source category is generated rather than extracted, and is a candidate for verification or replacement. The magnitude-reference test sits alongside, cross-checking the figure against unrelated benchmarks. An AI estimate that a ten-person software firm should spend £420,000 a year on operations staff is credible only if comparable-firm surveys put the band in that range. If the surveys say £200,000 to £350,000, the AI figure is an outlier worth examining.
When should owners spot-check, sample, or recompute from source?
Verification intensity should match the stakes, because no small firm has the capacity to recompute everything. Spot-check a single figure when it is high impact and sits inside your domain expertise. Sample-check three to five figures when the AI output is a multi-line document. Recompute from source data when the figure will drive a strategic decision or appear in a funding application.
Spot-checking works well when you know the territory. A founder who has run the business for a decade can usually tell whether a projected 15 per cent Q4 margin aligns with the cost structure and seasonality they have lived with. Sample-checking is the move for documents with many related figures, three to five randomly selected lines verified against payroll, supplier invoices, or the bank statement is usually enough to surface systematic issues if any are present. Recomputing is reserved for the high-stakes work, anything you would be challenged on by a lender, an investor, a regulator, or a buyer in due diligence. A simple spreadsheet built from documented inputs is almost always more defensible than the AI output, and the time cost is repaid the moment the figure is scrutinised.
How should the team know the difference between generated and verified numbers?
By labelling. Every number in a planning document should carry a tag indicating its origin, “AI-generated from inputs as of [date]”, “Extracted from [system]”, “Calculated from [specified inputs]”, or “Verified by [person/date]”. The label travels with the figure through decks and prevents the silent merging of generated forecasts with extracted-from-contracts baselines. A board pack that conflates the two has lost its data lineage.
The rule on which kind belongs where is straightforward. AI-generated figures are fine for brainstorming, scenario modelling, and identifying questions worth investigating further. They are out of place in funding applications, regulatory filings, board presentations to external parties, or any communication where inaccuracy carries material risk. The same figure can move through life stages, an AI estimate becomes the starting point for a conversation, the conversation produces a documented calculation, the documented calculation becomes the figure that reaches the stakeholder. The discipline is to treat AI output as draft pending verification by default, not as a final number with the verification step quietly skipped. The Kyriba 2025 CFO survey found 76 per cent of finance leaders already concerned about AI accuracy and 61 per cent admitting to second-guessing their data monthly even without AI in the loop, the labelling rule formalises a judgement they are already making informally.
If you want a sounding board on where AI-generated numbers are already travelling through your own decision documents, book a conversation.



