Measuring AI hours saved without lying to yourself

You’re about to include “saves the team ten hours a week” in the board deck. The figure came from a message sent to the team on Friday, nine replies by Monday, numbers averaged. It looks clean on a slide.

Then the CFO asks how you measured it.

That question shouldn’t throw you. But it will, if the only honest answer is “we asked people to estimate.” The team poll has a fundamental problem that sits before any question about recall or honesty: the people who answered had opposing incentives to distort the number, and almost no way to notice they were doing it.

Why does asking your team for the number backfire?

When you ask someone how much time the AI saves them, you are inviting them to self-report on a technology decision they either backed or were subject to. That creates two opposing pressures that push the answer away from the truth. Underneath both sits a third problem that is less visible but equally damaging: people are genuinely poor at estimating their own time.

The person who championed the tool has an incentive to report substantial savings. Their credibility is attached to the decision. Reporting three hours saved when they expected ten is an uncomfortable conversation, and without meaning to, their estimate drifts upward.

The person who worries about what efficiency gains might mean for headcount does the opposite. Their number drifts down.

Neither is deliberate. Both are ordinary responses to incentive.

And even when incentives are reasonably aligned, time estimation is unreliable. Research on worker productivity consistently finds significant gaps between recalled time and actual time, particularly on cognitively demanding or fragmented tasks. The document review that “only took a moment” included several interruptions. The task that felt like two hours often ran to three.

The retrospective survey has a use: as a rough sanity check after you have done the proper measurement. As the primary basis for a financial claim, it does not hold up.

What does a proper before-and-after time study look like?

A before-and-after time study runs the same protocol on a specific task twice: once before the AI is in use, and once after users have had time to reach reasonable competence. The two measurements give you a comparison you can defend, because they used the same method, the same sample, and the same task definition. Everything else is directional.

Start the baseline phase before the AI tool is deployed. Define the task precisely. Not “document work” but “first-pass relevance review for client files.” Time how long it takes, per person, across a representative sample. Two weeks is the minimum; four weeks captures more of the natural variation in workload.

Sample five to ten people, across a spread of experience levels where possible. Record time to the nearest five or fifteen minutes. High-precision logging to the nearest minute introduces false accuracy: knowledge work involves split attention, context switching, and interruptions, and claiming precision you do not have weakens the methodology.

Run the same protocol four to six weeks into the AI deployment, once people have moved past the initial learning phase. Same task definition, same sample size, same measurement period.

The difference between your two mean figures is your hours-saved. Research from the Standish Group’s CHAOS studies suggests that roughly one third of technology projects deliver expected benefits; the ones that do typically have defined success criteria and measurement protocols in place from the start. The protocol separates a directional hunch from a number with methodology behind it.

Where does rework time hide in the calculation?

AI output is often faster than human output on the first pass. The place measurement falls apart is the second pass: the checking, the revision, and the downstream rework that follows when the first pass contains errors. If you count only the time saved and ignore the time spent validating, the number in your board deck is partially illusory.

Consider a straightforward version of the problem. The AI reviews documents in forty per cent less time than a human. You record the saving and move on. But the team has developed a habit of checking AI-flagged items more carefully than they would check their own work, because they know the tool occasionally misses nuance on complex cases. That checking time sits outside the official measurement.

The correct calculation subtracts the quality-assurance overhead. If half the AI output is accepted without revision and half requires ten minutes of rework, the net saving is fifty per cent of the headline figure minus those ten minutes per document.

The same logic applies to downstream problems. A misclassified document that creates a compliance issue three weeks later costs real hours to resolve. That time belongs in the measure.

Research from Microsoft Research, Stanford HAI, and MIT on AI output quality in professional services contexts finds that users consistently underestimate AI errors, particularly when the output looks plausible. The cost of that underestimation shows up in rework.

When should you use an activity log instead of a time study?

A full time study requires someone to coordinate sampling, standardise the task definition, and run the protocol twice. For smaller firms, or for a low-stakes pilot where you want a rough-and-ready directional figure, that overhead may not be proportionate. The activity log is a practical middle ground: users record their own time in near-real-time against a structured template.

Research on worker productivity measurement suggests activity logs give a substantial proportion of a time study’s accuracy at a fraction of the cost. The trade-off is compliance: logs are filled in reliably for the first two or three weeks, then less so. They work best over a bounded two-week window, not as an ongoing practice.

Structure the log around the specific tasks you want to measure, with a clear field for whether the AI was involved. That distinction must be present from day one; adding it retrospectively recreates the recall problem you were trying to avoid.

For a firm with under twenty professionals, a four-week activity log completed daily is usually the most practical path to a credible figure. For firms with clear task definitions and enough staff to support sampling, a proper time study produces a number with more authority.

How do you present an honest number without undermining it?

A narrow time study with a small sample produces a number that carries a genuine margin of error. The instinct is to round that margin away so the figure sounds cleaner in a presentation. Do the opposite. State the error bar, and the CFO’s scepticism will work in your favour, because methodical measurement with stated limitations is harder to dismiss than a suspiciously clean figure.

With five to ten professionals and a two-week time study, the confidence interval on the mean is roughly plus or minus twenty to thirty per cent. That means “we saved 4.2 hours per person per week” should be reported as “between 3.2 and 5.2 hours, based on a two-week time study of seven professionals, with rework included.”

That sentence earns more trust, not less.

A clean figure invites deconstruction (“how do you know it’s exactly 4.2?”). A stated methodology with a range signals you understood the limitations and measured anyway. Boards are more likely to accept a well-evidenced range than to probe a suspiciously precise point estimate.

The hours-saved figure is the foundation. Once you have one that holds up, the next question a board will ask is what those hours are being redeployed into. Hours saved in a professional services firm do not automatically show up as margin; that conversion requires a separate decision and a separate plan. But the plan starts from having a credible hours-saved figure in the first place.

Measuring hours saved without lying to yourself

Key takeaways

Why does asking your team for the number backfire?

What does a proper before-and-after time study look like?

Where does rework time hide in the calculation?

When should you use an activity log instead of a time study?

How do you present an honest number without undermining it?

Sources

Frequently asked questions

How do I know if our hours-saved figure is reliable enough to take to the board?

What is the minimum measurement required to produce a credible hours-saved figure?

Does output quality need to be measured separately from time saved?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Measuring hours saved without lying to yourself

Key takeaways

Why does asking your team for the number backfire?

What does a proper before-and-after time study look like?

Where does rework time hide in the calculation?

When should you use an activity log instead of a time study?

How do you present an honest number without undermining it?

Sources

Frequently asked questions

How do I know if our hours-saved figure is reliable enough to take to the board?

What is the minimum measurement required to produce a credible hours-saved figure?

Does output quality need to be measured separately from time saved?

Ready to talk it through?

Related reading

Why the time AI saves never reaches the bottom line

Where AI pays back first in a professional services firm

Where AI pays back first on a construction project

If any of this sounds familiar, let's talk.