Why the AI hours-saved survey is not measurement

A finance manager at a desk with a clipboard time-log and laptop in late afternoon light
TL;DR

The retrospective survey ('on average, how many hours per week does the AI save you?') is everywhere in SME AI ROI conversations and is essentially indefensible to a CFO. Time-study or activity-log measurement, with five to ten people across two weeks, gives a hours-saved figure with plus or minus 15 to 20 percent error bars.

Key takeaways

- The retrospective survey is biased by overstatement, understatement, misremembering, and confusion of hours-saved with hours-redirected. - Before-and-after time-study with five to ten people across two weeks gives a defensible figure with plus or minus 20 to 30 percent error bars. - Activity logs capture roughly 70 to 80 percent of full time-study accuracy at a fraction of the cost. - Quality and rework time must be captured, otherwise hours-saved is overstated by 30 to 50 percent. - The Brynjolfsson and Li 2023 customer service AI paper found 35 to 40 percent handle-time reduction for new workers, much smaller for experienced workers on complex cases.

Picture a director I’ll call Richard. Mid-tier firm, thirty fee-earners, twelve months into a Copilot rollout. Last quarter he stood in the boardroom and reported, “the team says it saves about ten hours a week on average.” The chair nodded. The CFO did not write anything down. Richard noticed the silence and only later worked out what it meant. The number had been reported confidently. It had not been validated. It was the only ROI input the board paper had.

The retrospective survey, “on average, how many hours per week does the AI save you?”, is everywhere in SME AI ROI conversations. It is also basically indefensible to a CFO. Two better methods exist, both feasible at SME scale.

Why is the retrospective survey biased?

Asking users to estimate their own hours-saved relies on memory and self-report, both of which are unreliable. Multiple biases run through the answer. Users have an incentive to overstate when they feel ownership of the rollout. They have an incentive to understate when they fear redundancies. They simply misremember. They sometimes report hours-redirected as hours-saved without noticing the difference.

Studies on human time estimation consistently show people are poor at quantifying their own time allocation, particularly on tasks where attention is divided or the work is cognitively demanding. The retrospective survey can sit alongside other methods as a sanity check; it should not be the primary basis for a financial claim.

The retrospective survey persists because the alternative looks heavy. Most firms see “time-study” and assume they are being asked to mount a research project they do not have the capacity for. They are not.

What does proper time-study look like?

The before-and-after time-study method is the right way and it is achievable for any SME with a finance manager and a few weeks of patience. The structure is simple. First, establish a baseline by timing how long the task takes with the existing method. Do this over a representative sample period, at least two weeks, ideally four. Time multiple people. Sample work randomly, because it is easy to time yourself when you are working quickly.

Record time to the nearest five or fifteen minutes. High-precision timing for knowledge work is usually false precision; the cognitive work does not happen in clean blocks. The end result is a baseline time per task or per hour of work.

Second, after the AI has been in use for four to six weeks, conduct the same time-study with the AI in place. Same tasks. Same sample size. Same sample period. The difference is the hours-saved figure.

The important caveat is to measure the same output quality. If the AI method produces work that requires rework or quality improvement, the rework time has to be captured. A measurement framework that reports “the AI saves 30 percent of time per document” without noting that it introduces errors requiring rework is marketing, not measurement.

What about activity logs?

The activity-log method is a middle ground. It is more feasible than full time-study for many SMEs and substantially more reliable than retrospective survey. Users maintain logs of time spent on specific tasks, categorised by whether they used AI or did the work manually. The log is structured. Users record specific activities and time, with boxes to tick for AI usage. Real-time logging beats retrospective recall.

Research on worker productivity measurement suggests activity logs capture roughly 70 to 80 percent of the accuracy of full time-study, at a fraction of the cost. They have limits. Compliance deteriorates over time. Users who know they are being tracked work differently. For most SMEs, activity logs strike the right balance between rigour and effort.

The practical version is a one-page log, real-time entry, two weeks before deployment and two weeks after, across five to ten people. The log has limits and it is defensible, the kind of artefact a CFO can interrogate.

What sample size and duration give a defendable number?

For a firm with fifty to a hundred staff, a baseline and follow-up time-study of two weeks per measurement, with five to ten people sampled, is enough to arrive at a defensible hours-saved figure. The firm can confidently say: “We measured the time required to complete document review over two weeks before deployment and two weeks after. We sampled X documents and Y professionals. The mean time reduction was Z hours per document.”

With five individuals and ten tasks, the confidence interval on the mean is approximately plus or minus 20 to 30 percent. That margin is honest and defensible, the kind of figure a CFO can interrogate without it dissolving in their hands.

For a smaller firm, ten to twenty professionals, a longer measurement period of four weeks becomes more important. Each individual’s productivity pattern has more impact on the aggregate when the sample is small.

The point of explicit sampling and duration is the ability to answer a CFO who asks “how did you measure that?” with something specific. “We sampled five lawyers across two weeks before rollout and five lawyers across two weeks after, on first-pass document relevance review” is an answer. “The team says about ten hours” is not.

What does the research actually find?

The Brynjolfsson and Li 2023 paper on AI customer service, refined in 2024-2025 Stanford Digital Economy Lab follow-ups, looked at customer service interactions with AI assistance available. The finding: for new workers on routine issues, AI reduced average handle time by roughly 35 to 40 percent. For experienced workers on complex cases, the benefit was much smaller, sometimes negative. They worked around the AI because it was not helpful for their cases.

The broader pattern from the Stanford work is that AI productivity effects are heavily right-skewed. A minority of workers see substantial productivity gains, fifty to one hundred percent. A larger group sees modest gains, ten to twenty percent. Some see negligible or negative effects. The determinants of who benefits include prior task familiarity, workflow integration, and manager skill at assigning AI to suitable work.

The honest takeaway is that hours-saved in the range of twenty to fifty percent is achievable for suitable tasks and workers, with a mean closer to thirty percent. That figure is not universal. Generalising one team’s benefit to another is a common source of overestimation, which is exactly what the retrospective survey method tends to do.

If you want to know your firm’s actual hours-saved figure rather than your team’s recollection of it, the methodology is straightforward but deliberate. If you’d like to talk through how to set up a defensible time-study or activity-log measurement for your AI deployment, book a conversation.

Sources

  • Brynjolfsson and Li, 2023: AI customer-service productivity paper, finding 35-40% handle-time reduction for new workers on routine issues, much smaller or negative effects for experienced workers on complex cases. Source.
  • Stanford Digital Economy Lab 2024-2025 follow-up research: AI productivity effects heavily right-skewed, mean closer to 30% across suitable tasks, with substantial dispersion driven by task familiarity, workflow integration, and manager skill at AI assignment. Source.
  • Worker productivity measurement research: activity logs capture roughly 70-80% of full time-study accuracy at a fraction of the cost; subject to gaming and compliance drop-off but substantially more reliable than retrospective recall. Source.
  • McKinsey & Company (2025). The State of AI Global Survey. 88 per cent of organisations now use AI in at least one function but only 39 per cent report enterprise-level EBIT impact, the measurement gap that maturity frameworks address. Source.
  • McKinsey & Company (2024). From Promise to Impact, How Companies Can Measure and Realise the Full Value of AI. Five-layer measurement framework spanning technical performance, adoption, operational KPIs, strategic outcomes, financial impact. Source.
  • MIT CISR (Woerner, Sebastian, Weill and Kaganer, 2025). Grow Enterprise AI Maturity for Bottom-Line Impact. Stage 3 enterprises achieve growth 11.3 percentage points and profit 8.7 percentage points above industry average; Stage 1 firms underperform on both. Source.
  • Boston Consulting Group (2025). Are You Generating Value from AI, The Widening Gap. Five per cent of future-built firms achieve five times the revenue gains and three times the cost reductions of peers, with 60 per cent reporting almost no material value from AI investment. Source.
  • Standish Group, CHAOS Report (2020). Long-running benchmark of IT-project outcomes. 31 per cent succeed on contemporary definitions, 50 per cent are challenged, 19 per cent fail outright, the historical baseline for technology-investment measurement maturity. Source.

Frequently asked questions

Why is asking my team how many hours AI saves them not enough?

Retrospective survey is biased four ways. Users overstate if they feel ownership, understate if they fear redundancies, simply misremember, and often report hours-redirected as hours-saved. The number can sit alongside other methods as a sanity check, but it should not be the primary basis for a financial claim to a CFO.

What is a defendable hours-saved methodology for an SME?

Before-and-after time-study with five to ten people across two weeks per measurement, recording time to the nearest five to fifteen minutes. The same task, the same sample size, the same sample period, before and after AI deployment. Quality and rework time must be captured to avoid overstating the saving.

How do activity logs compare to time-study?

Activity logs capture roughly 70 to 80 percent of time-study accuracy at a fraction of the effort. Users record specific activities and time in real-time, with boxes to tick for AI usage. They are subject to gaming and compliance drop-off but substantially more reliable than retrospective recall.

What does the research actually find on AI hours saved?

Brynjolfsson and Li 2023 on AI customer service, refined in Stanford Digital Economy Lab 2024-2025 follow-ups, found 35 to 40 percent handle-time reduction for new workers on routine issues, but much smaller or negative effects for experienced workers on complex cases. The mean across suitable tasks is closer to 30 percent. The number is highly task-specific and worker-specific.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation