Measuring hours saved without lying to yourself

Person at a desk writing notes while reviewing a stack of printed documents
TL;DR

Asking your team how much time the AI saves them is the most widespread hours-saved measurement method and the most unreliable one, because the people answering have incentives to over- or under-report and poor recall regardless. A before-and-after time study on one specific task, with rework time counted in and an honest error bar stated, produces a figure that survives scrutiny from a sceptical CFO. The methodology earns the trust; the clean number rarely does.

Key takeaways

- Retrospective team surveys distort hours-saved figures in both directions: the person who championed the tool overstates, and the person who fears redundancy understates, and both misremember regardless. - A before-and-after time study on a single, precisely defined task, run over at least two weeks per measurement point with five to ten people in the sample, produces a figure you can defend because it has a methodology behind it. - Rework and quality-checking time must be subtracted from the headline figure; if AI output requires correction or downstream validation at a higher rate than human output would, the net saving is smaller than the first-pass comparison suggests. - The activity log is a practical middle ground for smaller firms or low-stakes pilots, giving a substantial proportion of a full time study's accuracy at a fraction of the coordination overhead, provided users complete it in near-real-time. - Reporting an honest error bar alongside the hours-saved figure increases credibility with sceptical boards, because stated limitations signal methodical measurement rather than a figure that was polished until it looked clean.

You’re about to include “saves the team ten hours a week” in the board deck. The figure came from a message sent to the team on Friday, nine replies by Monday, numbers averaged. It looks clean on a slide.

Then the CFO asks how you measured it.

That question shouldn’t throw you. But it will, if the only honest answer is “we asked people to estimate.” The team poll has a fundamental problem that sits before any question about recall or honesty: the people who answered had opposing incentives to distort the number, and almost no way to notice they were doing it.

Why does asking your team for the number backfire?

When you ask someone how much time the AI saves them, you are inviting them to self-report on a technology decision they either backed or were subject to. That creates two opposing pressures that push the answer away from the truth. Underneath both sits a third problem that is less visible but equally damaging: people are genuinely poor at estimating their own time.

The person who championed the tool has an incentive to report substantial savings. Their credibility is attached to the decision. Reporting three hours saved when they expected ten is an uncomfortable conversation, and without meaning to, their estimate drifts upward.

The person who worries about what efficiency gains might mean for headcount does the opposite. Their number drifts down.

Neither is deliberate. Both are ordinary responses to incentive.

And even when incentives are reasonably aligned, time estimation is unreliable. Research on worker productivity consistently finds significant gaps between recalled time and actual time, particularly on cognitively demanding or fragmented tasks. The document review that “only took a moment” included several interruptions. The task that felt like two hours often ran to three.

The retrospective survey has a use: as a rough sanity check after you have done the proper measurement. As the primary basis for a financial claim, it does not hold up.

What does a proper before-and-after time study look like?

A before-and-after time study runs the same protocol on a specific task twice: once before the AI is in use, and once after users have had time to reach reasonable competence. The two measurements give you a comparison you can defend, because they used the same method, the same sample, and the same task definition. Everything else is directional.

Start the baseline phase before the AI tool is deployed. Define the task precisely. Not “document work” but “first-pass relevance review for client files.” Time how long it takes, per person, across a representative sample. Two weeks is the minimum; four weeks captures more of the natural variation in workload.

Sample five to ten people, across a spread of experience levels where possible. Record time to the nearest five or fifteen minutes. High-precision logging to the nearest minute introduces false accuracy: knowledge work involves split attention, context switching, and interruptions, and claiming precision you do not have weakens the methodology.

Run the same protocol four to six weeks into the AI deployment, once people have moved past the initial learning phase. Same task definition, same sample size, same measurement period.

The difference between your two mean figures is your hours-saved. Research from the Standish Group’s CHAOS studies suggests that roughly one third of technology projects deliver expected benefits; the ones that do typically have defined success criteria and measurement protocols in place from the start. The protocol separates a directional hunch from a number with methodology behind it.

Where does rework time hide in the calculation?

AI output is often faster than human output on the first pass. The place measurement falls apart is the second pass: the checking, the revision, and the downstream rework that follows when the first pass contains errors. If you count only the time saved and ignore the time spent validating, the number in your board deck is partially illusory.

Consider a straightforward version of the problem. The AI reviews documents in forty per cent less time than a human. You record the saving and move on. But the team has developed a habit of checking AI-flagged items more carefully than they would check their own work, because they know the tool occasionally misses nuance on complex cases. That checking time sits outside the official measurement.

The correct calculation subtracts the quality-assurance overhead. If half the AI output is accepted without revision and half requires ten minutes of rework, the net saving is fifty per cent of the headline figure minus those ten minutes per document.

The same logic applies to downstream problems. A misclassified document that creates a compliance issue three weeks later costs real hours to resolve. That time belongs in the measure.

Research from Microsoft Research, Stanford HAI, and MIT on AI output quality in professional services contexts finds that users consistently underestimate AI errors, particularly when the output looks plausible. The cost of that underestimation shows up in rework.

When should you use an activity log instead of a time study?

A full time study requires someone to coordinate sampling, standardise the task definition, and run the protocol twice. For smaller firms, or for a low-stakes pilot where you want a rough-and-ready directional figure, that overhead may not be proportionate. The activity log is a practical middle ground: users record their own time in near-real-time against a structured template.

Research on worker productivity measurement suggests activity logs give a substantial proportion of a time study’s accuracy at a fraction of the cost. The trade-off is compliance: logs are filled in reliably for the first two or three weeks, then less so. They work best over a bounded two-week window, not as an ongoing practice.

Structure the log around the specific tasks you want to measure, with a clear field for whether the AI was involved. That distinction must be present from day one; adding it retrospectively recreates the recall problem you were trying to avoid.

For a firm with under twenty professionals, a four-week activity log completed daily is usually the most practical path to a credible figure. For firms with clear task definitions and enough staff to support sampling, a proper time study produces a number with more authority.

How do you present an honest number without undermining it?

A narrow time study with a small sample produces a number that carries a genuine margin of error. The instinct is to round that margin away so the figure sounds cleaner in a presentation. Do the opposite. State the error bar, and the CFO’s scepticism will work in your favour, because methodical measurement with stated limitations is harder to dismiss than a suspiciously clean figure.

With five to ten professionals and a two-week time study, the confidence interval on the mean is roughly plus or minus twenty to thirty per cent. That means “we saved 4.2 hours per person per week” should be reported as “between 3.2 and 5.2 hours, based on a two-week time study of seven professionals, with rework included.”

That sentence earns more trust, not less.

A clean figure invites deconstruction (“how do you know it’s exactly 4.2?”). A stated methodology with a range signals you understood the limitations and measured anyway. Boards are more likely to accept a well-evidenced range than to probe a suspiciously precise point estimate.

The hours-saved figure is the foundation. Once you have one that holds up, the next question a board will ask is what those hours are being redeployed into. Hours saved in a professional services firm do not automatically show up as margin; that conversion requires a separate decision and a separate plan. But the plan starts from having a credible hours-saved figure in the first place.

Sources

- Brynjolfsson, E., Li, D. & Raymond, L.R. (2023). Generative AI at Work. NBER Working Paper 31161. Documents 35-40% handle-time reduction via AI assistance for new workers on routine tasks; illustrates the task-specific and worker-specific variation that makes aggregate hours-saved claims unreliable. https://www.nber.org/papers/w31161 - Brynjolfsson, E. (2024). Stanford Digital Economy Lab AI productivity research. Shows right-skewed distribution of AI productivity gains with wide variance across workers and tasks; supports narrowing measurement to specific tasks rather than averaging across a function. https://digitaleconomy.stanford.edu/research/ - McKinsey Global Institute (2023). The economic potential of generative AI. Provides the value-leakage framework showing how productivity gains are absorbed into work expansion or client pricing rather than bottom-line margin, relevant to the hours-to-profitability conversion problem. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai - Standish Group (2024). CHAOS Report. Documents that approximately one third of technology projects deliver expected benefits; projects with defined success criteria and measurement protocols from the outset are substantially more likely to reach them. https://standishgroup.myshopify.com/ - Gartner (2024). Augmented-Decisioning framework for AI measurement. Distinguishes automated, augmented, and informational AI use cases; each requires a different measurement approach, supporting the case for task-specific time-study design. https://www.gartner.com/en/information-technology/topics/artificial-intelligence - ICAEW (2024). Technology and AI resources for accountancy practices. Provides adoption and measurement context for UK professional services firms; directly relevant to the sector-specific dynamics of hours-saved measurement in fee-earning environments. https://www.icaew.com/technical/technology/artificial-intelligence - Solicitors Regulation Authority (2024). AI guidance for solicitors. Sets out quality-control requirements for AI use in legal practice; directly relevant to why quality-checking and rework time must feature in hours-saved calculations for professional services firms. https://www.sra.org.uk/solicitors/resources/future-of-legal-services/ai/ - GitHub (2023). Quantifying GitHub Copilot's impact on developer productivity and happiness. Controlled assessment showing AI code output varies on multiple quality dimensions; faster on standard tasks, with edge-case defects that do not appear in first-pass speed measurements. https://github.blog/2023-10-10-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/ - Stanford HAI and MIT CSAIL (2024). Research on AI output quality in professional services contexts. Finds systematic overestimation of AI output quality by users who see plausible output and assume accuracy; errors emerge at scale in downstream processes, not at the point of generation. https://hai.stanford.edu/research

Frequently asked questions

How do I know if our hours-saved figure is reliable enough to take to the board?

If the figure came from a team survey, treat it as directional at best. A defensible figure requires a before-and-after time study on a specific task, run over at least two weeks per measurement point, with rework time included. With five to ten people in the study, expect an error bar of roughly plus or minus twenty to thirty per cent on the mean. That range is honest enough to put in front of a CFO.

What is the minimum measurement required to produce a credible hours-saved figure?

Define one task precisely, time it across five to ten people for two weeks before the AI is deployed, then run the same protocol four to six weeks after deployment. Record rework separately and subtract it from the headline saving. The difference in mean time, with the methodology stated and the error bar disclosed, is a figure you can defend.

Does output quality need to be measured separately from time saved?

It cannot safely be separated. If the AI produces work faster but at a higher error rate, the rework time belongs in the calculation. Downstream problems, such as a compliance issue created by a misclassified output, also carry real cost. Time saved and quality impact are two dimensions of the same number, not two separate questions.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation