How good is your AI measurement, really? The five levels

A professional reviewing data on a laptop with a handwritten notebook open beside them
TL;DR

Most owner-managed businesses are between levels one and two on the AI measurement ladder, tracking licences or asking users retrospectively how much time the tool saves. Level three is the threshold worth crossing. A structured approach, runnable internally in twelve months without consultancy support, produces measurement defensible enough to survive a hard question from the CFO.

Key takeaways

- The AI measurement maturity ladder has five levels, from anecdotal impressions at the bottom to continuous portfolio discipline at the top. - Around fifty to sixty per cent of owner-managed businesses that buy AI tools are at level one; a further quarter to a third are at level two, both below the threshold for defensible measurement. - Level three is the rung where board scrutiny becomes feasible; reaching it requires roughly forty to sixty hours of internal work distributed over twelve months. - Hours-saved figures from retrospective surveys tend to overstate the real position; a structured time-study before and after deployment produces a figure that holds up. - Measured productivity gains often do not convert directly to financial improvement unless the business has a deliberate plan for where freed-up time will go.

You have been collecting wins. A process that used to eat a morning gets done before lunch. The team is positive. You have been telling the story upward and it has mostly landed.

Then, around month six or nine, a senior person asks the harder version of the question. They want to know what this has actually delivered, in numbers, with evidence they could hand to an auditor. And you realise that what you have been reporting is a story built on impressions rather than measurement.

That gap has a name. It is called measurement maturity. Knowing which level you are at tells you exactly what to fix first.

What is the AI measurement maturity ladder?

The AI measurement maturity ladder is a five-level framework describing how rigorously a business tracks the return on its AI spending. At the bottom, the evidence is impressions. At the top, firms run structured portfolio reviews with documented metrics, assigned ownership, and explicit go/no-go decisions. Many owner-managed businesses with AI in place sit somewhere between levels one and two, which means they have a number in mind but could not defend it under scrutiny.

Level one is anecdotal. The firm uses AI tools and believes they are probably working, but has no baseline, no time-study, and no financial tracking. The evidence is what individuals say. Around half of owner-managed businesses that purchase AI are still at this level six months in.

Level two is hours-saved tracked, but unreliably. The firm has a number, typically gathered by asking users retrospectively how much time the tool saves each week. The figure exists and circulates internally, but it has not been validated through a structured time-study, and quality tradeoffs have not been examined. Around a quarter to a third of AI-using businesses sit here.

Level three is defensible methodology. The firm has conducted a structured time-study before and after deployment, assessed output quality against a defined rubric, and documented the method clearly enough to present to an external reviewer. Roughly one in ten businesses reaches this level. It is the threshold above which CFO and board scrutiny becomes feasible.

Level four is decision-grade ROI. Measurement is embedded in the business rhythm, with quarterly tracking and a formal annual review that produces a documented go/no-go decision. Fewer than five in a hundred businesses operate here.

Level five is continuous portfolio discipline, where the same rigour extends across every technology investment the firm makes. Fewer than one in a hundred manage it.

Why does your measurement level matter for the business?

Your measurement level determines whether your AI spend can survive internal scrutiny. A level-one or level-two answer to the question of what the investment has delivered relies on impressions or unvalidated figures. A level-three answer is grounded in a defined method, with documented assumptions and an honest estimate of error. Those are different conversations with a CFO or a board, and they produce different decisions.

The gap matters beyond board conversations. Without reliable numbers, businesses tend to continue tools they should have stopped and under-resource ones that are genuinely working. Research on technology adoption suggests that organisations operating at level three or above achieve around twenty to thirty per cent better returns on their technology investments than those at level one, largely because they identify problems early enough to act on them rather than discovering the truth at renewal.

There is a specific problem with operating at level two. The hours-saved figure produced by retrospective survey tends to overstate the real position. Studies of human time estimation consistently show that people are unreliable judges of their own time allocation, particularly on cognitively demanding work. When users know a tool is being evaluated, their estimates pull in competing directions. A number produced this way may feel reassuring in a team meeting, but it will not hold up under direct examination.

Where do most owner-managed businesses actually sit on the ladder?

Around fifty to sixty per cent of owner-managed businesses that have purchased AI tools are at level one. A further quarter to a third are at level two, tracking some version of hours-saved without time-study validation. That puts between three quarters and nine in ten businesses below the threshold where their measurement would survive a hard question. This is the base case, not the exception.

The reason is not indifference. Owner-managed businesses rarely have someone whose job includes programme evaluation or financial attribution. The operations director running the AI rollout lacks the time and often the training to implement a rigorous framework. The finance manager is aware the spend was significant but has not been given the tools to assess it. Responsibility is diffuse and no one person feels they own the measurement question.

Sector research reinforces this pattern. The ICAEW’s technology benchmarks for accountancy practices and the Law Society’s annual technology surveys both show that measurement failure in professional services is the default outcome when implementation is not accompanied by a deliberate measurement plan. Many firms only confront the gap when a renewal is approaching or when a senior stakeholder asks the question that the current evidence cannot answer.

When is the jump from level 2 to level 3 worth making?

The jump from level 2 to level 3 is worth making when the AI spend is large enough to justify scrutiny, when a renewal decision is approaching, or when you plan to extend AI further across the business. At that point, you need numbers that can inform a decision rather than support a prior view. Research suggests reaching level 3 requires roughly forty to sixty hours of internal work across twelve months.

Those hours break down into four activities: establishing a baseline of current performance before the AI is deployed, running a structured time-study after deployment on a defined sample of work, conducting a blinded quality assessment against a clear rubric, and documenting the method and assumptions clearly. That is roughly a half-week of distributed work across a year.

The investment is proportionate to what is at stake. A typical AI tool spend in professional services runs to between twenty thousand and fifty thousand pounds per year once software, implementation, and management time are included. The cost of good measurement against that spend is modest, and the payback is the ability to make a rational decision at renewal rather than defaulting to sunk-cost reasoning.

The jump from level three to level four is a different kind of challenge. It requires embedding measurement into the normal operating rhythm, which is a leadership commitment more than a technical task. But the level 2 to level 3 jump is the one with the clearest near-term payback, and it is the right first move for anyone who wants their numbers to hold up.

What else shapes whether your AI numbers hold up?

Two things undercut measurement quality even for businesses that have tried to do it properly. The first is running a hours-saved calculation without accounting for output quality. The second is treating measured productivity gains as equivalent to financial improvement, when the two are often not the same thing. Getting both right is what separates a number that can be defended from one that falls apart under examination.

On quality: if the AI method produces work that is faster but lower quality, requiring more review or correction downstream, the net time saving is smaller than the headline figure. Research from the Stanford Digital Economy Lab found that productivity gains from AI are right-skewed across workers and tasks. A measurement that tracks speed without tracking quality captures an incomplete picture, and the incomplete part is often where the problems are.

On financial conversion: measured productivity gains often do not convert directly to margin improvement. Research on technology adoption patterns suggests that roughly forty to fifty per cent of measured time savings become actual cost reduction, with the remainder absorbed into work expansion, quality improvement, or spare capacity. If your measurement framework captures hours-saved but not where those hours went, you cannot tell the CFO whether the investment improved the business or simply changed how the team’s time is allocated.

Both problems are manageable with a level-three approach. The minimum viable framework includes a structured time-study, a blinded quality assessment on a sample of work, and an explicit account of where freed-up time has been deployed. That combination produces a number worth taking into a board room.

Sources

- Brynjolfsson, E., Li, D. & Raymond, L.R. (2023). Generative AI at work. NBER Working Paper 31161. Stanford Digital Economy Lab research demonstrating that AI productivity gains are right-skewed, with average effects that vary substantially by worker type and task suitability. https://www.nber.org/papers/w31161 - McKinsey & Company (2023). The economic potential of generative AI: the next productivity frontier. Analysis of value leakage patterns showing that roughly 40-50% of measured AI productivity gains convert to cost reduction, with the remainder absorbed into work expansion or slack. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai - ICAEW (2024). Technology in practice: digital skills and capabilities survey. Institute of Chartered Accountants in England and Wales data on AI adoption patterns in UK accountancy practices, identifying measurement failure as the default rather than the exception. https://www.icaew.com/technical/technology/technology-and-finance - Law Society (2025). Technology and innovation in the legal sector. Annual survey of AI adoption, measurement discipline, and value realisation in UK legal practices. https://www.lawsociety.org.uk/topics/research - Standish Group (2024). CHAOS Report. Cross-industry analysis of technology project outcomes showing that approximately 35% of projects deliver expected benefits, with a median ROI shortfall of around 30% against projections. https://www.standishgroup.com - IBM Institute for Business Value (2024). CEO's guide to generative AI. Research on measurement capability gaps in mid-market organisations and their impact on realised returns from AI deployments. https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ceo-generative-ai - SEI/Carnegie Mellon University (2010). Capability Maturity Model Integration (CMMI). Foundational framework for technology capability maturity levels, on which the AI measurement maturity ladder is modelled. https://cmmi.sei.cmu.edu - Gartner (2025). Augmented-decisioning framework for AI value measurement. Analysis of how to distinguish automation, augmentation, and informational AI use cases for appropriate ROI tracking methodology. https://www.gartner.com/en/information-technology/insights/artificial-intelligence

Frequently asked questions

What level of AI measurement do most owner-managed businesses operate at?

Around fifty to sixty per cent of owner-managed businesses that purchase AI tools are at level one, relying on impressions rather than measured data. A further quarter to a third are at level two, tracking some version of hours-saved through retrospective survey. That puts the large majority below the threshold where their numbers could survive scrutiny from a board or CFO.

How much internal time does it take to reach level 3 measurement?

Moving from anecdotal or unvalidated tracking to a defensible methodology requires roughly forty to sixty hours of internal work across twelve months. That involves establishing a baseline before deployment, running a structured time-study afterwards, conducting a blinded quality assessment on a sample of work, and documenting the method clearly enough that someone external could review it.

Why doesn't measuring hours-saved on its own tell the whole story?

Two gaps undercut a hours-saved figure in isolation. First, if the AI method produces work of lower quality requiring more review downstream, the net time saving is smaller than the headline number. Second, even accurate time savings do not automatically convert to margin improvement if freed-up time is absorbed into work expansion or spare capacity rather than deployed deliberately.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation