Why the AI hours-saved survey is not measurement

Picture a director I’ll call Richard. Mid-tier firm, thirty fee-earners, twelve months into a Copilot rollout. Last quarter he stood in the boardroom and reported, “the team says it saves about ten hours a week on average.” The chair nodded. The CFO did not write anything down. Richard noticed the silence and only later worked out what it meant. The number had been reported confidently. It had not been validated. It was the only ROI input the board paper had.

The retrospective survey, “on average, how many hours per week does the AI save you?”, is everywhere in SME AI ROI conversations. It is also basically indefensible to a CFO. Two better methods exist, both feasible at SME scale.

Why is the retrospective survey biased?

Asking users to estimate their own hours-saved relies on memory and self-report, both of which are unreliable. Multiple biases run through the answer. Users have an incentive to overstate when they feel ownership of the rollout. They have an incentive to understate when they fear redundancies. They simply misremember. They sometimes report hours-redirected as hours-saved without noticing the difference.

Studies on human time estimation consistently show people are poor at quantifying their own time allocation, particularly on tasks where attention is divided or the work is cognitively demanding. The retrospective survey can sit alongside other methods as a sanity check; it should not be the primary basis for a financial claim.

The retrospective survey persists because the alternative looks heavy. Most firms see “time-study” and assume they are being asked to mount a research project they do not have the capacity for. They are not.

What does proper time-study look like?

The before-and-after time-study method is the right way and it is achievable for any SME with a finance manager and a few weeks of patience. The structure is simple. First, establish a baseline by timing how long the task takes with the existing method. Do this over a representative sample period, at least two weeks, ideally four. Time multiple people. Sample work randomly, because it is easy to time yourself when you are working quickly.

Record time to the nearest five or fifteen minutes. High-precision timing for knowledge work is usually false precision; the cognitive work does not happen in clean blocks. The end result is a baseline time per task or per hour of work.

Second, after the AI has been in use for four to six weeks, conduct the same time-study with the AI in place. Same tasks. Same sample size. Same sample period. The difference is the hours-saved figure.

The important caveat is to measure the same output quality. If the AI method produces work that requires rework or quality improvement, the rework time has to be captured. A measurement framework that reports “the AI saves 30 percent of time per document” without noting that it introduces errors requiring rework is marketing, not measurement.

What about activity logs?

The activity-log method is a middle ground. It is more feasible than full time-study for many SMEs and substantially more reliable than retrospective survey. Users maintain logs of time spent on specific tasks, categorised by whether they used AI or did the work manually. The log is structured. Users record specific activities and time, with boxes to tick for AI usage. Real-time logging beats retrospective recall.

Research on worker productivity measurement suggests activity logs capture roughly 70 to 80 percent of the accuracy of full time-study, at a fraction of the cost. They have limits. Compliance deteriorates over time. Users who know they are being tracked work differently. For most SMEs, activity logs strike the right balance between rigour and effort.

The practical version is a one-page log, real-time entry, two weeks before deployment and two weeks after, across five to ten people. The log has limits and it is defensible, the kind of artefact a CFO can interrogate.

What sample size and duration give a defendable number?

For a firm with fifty to a hundred staff, a baseline and follow-up time-study of two weeks per measurement, with five to ten people sampled, is enough to arrive at a defensible hours-saved figure. The firm can confidently say: “We measured the time required to complete document review over two weeks before deployment and two weeks after. We sampled X documents and Y professionals. The mean time reduction was Z hours per document.”

With five individuals and ten tasks, the confidence interval on the mean is approximately plus or minus 20 to 30 percent. That margin is honest and defensible, the kind of figure a CFO can interrogate without it dissolving in their hands.

For a smaller firm, ten to twenty professionals, a longer measurement period of four weeks becomes more important. Each individual’s productivity pattern has more impact on the aggregate when the sample is small.

The point of explicit sampling and duration is the ability to answer a CFO who asks “how did you measure that?” with something specific. “We sampled five lawyers across two weeks before rollout and five lawyers across two weeks after, on first-pass document relevance review” is an answer. “The team says about ten hours” is not.

What does the research actually find?

The Brynjolfsson and Li 2023 paper on AI customer service, refined in 2024-2025 Stanford Digital Economy Lab follow-ups, looked at customer service interactions with AI assistance available. The finding: for new workers on routine issues, AI reduced average handle time by roughly 35 to 40 percent. For experienced workers on complex cases, the benefit was much smaller, sometimes negative. They worked around the AI because it was not helpful for their cases.

The broader pattern from the Stanford work is that AI productivity effects are heavily right-skewed. A minority of workers see substantial productivity gains, fifty to one hundred percent. A larger group sees modest gains, ten to twenty percent. Some see negligible or negative effects. The determinants of who benefits include prior task familiarity, workflow integration, and manager skill at assigning AI to suitable work.

The honest takeaway is that hours-saved in the range of twenty to fifty percent is achievable for suitable tasks and workers, with a mean closer to thirty percent. That figure is not universal. Generalising one team’s benefit to another is a common source of overestimation, which is exactly what the retrospective survey method tends to do.

If you want to know your firm’s actual hours-saved figure rather than your team’s recollection of it, the methodology is straightforward but deliberate. If you’d like to talk through how to set up a defensible time-study or activity-log measurement for your AI deployment, book a conversation.

Why the AI hours-saved survey is not measurement

Key takeaways

Why is the retrospective survey biased?

What does proper time-study look like?

What about activity logs?

What sample size and duration give a defendable number?

What does the research actually find?

Sources

Frequently asked questions

Why is asking my team how many hours AI saves them not enough?

What is a defendable hours-saved methodology for an SME?

How do activity logs compare to time-study?

What does the research actually find on AI hours saved?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Why the AI hours-saved survey is not measurement

Key takeaways

Why is the retrospective survey biased?

What does proper time-study look like?

What about activity logs?

What sample size and duration give a defendable number?

What does the research actually find?

Sources

Frequently asked questions

Why is asking my team how many hours AI saves them not enough?

What is a defendable hours-saved methodology for an SME?

How do activity logs compare to time-study?

What does the research actually find on AI hours saved?

Ready to talk it through?

Related reading

AI in B2B SaaS and tech firms in 2026

AI in UK hospitality 2026: where the margin actually moves

AI in UK manufacturing in 2026: five use cases, six constraints, and Made Smarter as the route in

If any of this sounds familiar, let's talk.