A practical way to measure AI productivity gains

A person reviewing printed charts and handwritten notes at an office desk with a laptop open beside them
TL;DR

The practical way to measure AI productivity gains in a UK services firm is to pick one workflow, set a baseline before rollout, track two to four task-level metrics, and re-measure consistently over 90 days. Evidence from the UK AI Security Institute shows gains appear on quality, speed, or both depending on the task, so measuring only one dimension gives an incomplete picture. Data protection and security governance belong in the measurement design from the start.

Key takeaways

- Pick one workflow and set a two-to-four-week baseline before introducing any AI tool. Without a pre-AI comparison, you cannot separate the tool's effect from normal variation in how your team works. - Track two to four task-level metrics: time per task, first-pass quality or error rate, throughput per person per week, and client turnaround time. Quality and speed gains often appear on different tasks and need to be measured separately. - Monitor adoption alongside output. A tool used inconsistently by half the team will suppress measured productivity gains even where individual users are faster. - Re-measure at 30, 60, and 90 days. Initial gains can reflect novelty rather than genuine improvement, and sustained measurement is the only way to distinguish the two. - Check data protection and security governance before scaling. UK GDPR, a potential Data Protection Impact Assessment, and NCSC security controls belong in the pilot design from the start, not as an afterthought.

A marketing agency owner I spoke with recently had been running AI writing tools across her small team for about three months. She was confident something was working. Her team seemed to spend less time on first drafts. Client feedback had improved slightly. But when I asked what had actually changed, she paused. She hadn’t measured it before, during, or after. She had impressions, but no data.

That gap is expensive. Without a baseline, you cannot tell where the AI is actually helping, where it is not, and whether any change will hold beyond the first few weeks of novelty.

This post sets out a practical measurement approach for owner-operated UK services firms with five to fifty staff. The framework draws on task-level evidence from the UK AI Security Institute’s 2025 pilot study and productivity research from McKinsey UK, not on vendor claims.

What does measuring AI productivity gains actually mean?

Measuring AI productivity gains means comparing a small set of before/after task metrics on one workflow, not calculating “AI ROI” across the whole business on day one. The UK AI Security Institute’s 2025 pilot study used a randomised controlled trial to isolate the effect of a large language model against a control group, tracking quality, time taken, and points per minute.

That task-level discipline is the model worth copying. The AISI results show why the task granularity matters: AI improved quality by 22% on a monitoring task and 23% on a technical drafting task, but did not improve speed on either. On a separate interpreting-information task, it improved speed by 42% and points per minute by 102% but showed no quality gain. The lesson is that “AI productivity” is not one thing. Quality and speed improvements show up on different tasks, and measuring only one misses half the picture.

This matters for a small services firm because different workflows have different bottlenecks. Proposal drafting, call summarisation, case triage, and client reporting all behave differently under AI assistance. You need to pick one, measure it properly, and resist the urge to track everything at once.

Why does it matter for your business?

Without a baseline, you cannot separate the effect of the AI tool from normal variation in how your team works week to week. This is a key reason productivity claims often collapse on inspection. McKinsey UK’s 2025 analysis describes a “new productivity paradox”: nearly 60% of UK survey respondents see AI as a productivity opportunity, but only 23% of firms in a UK enterprise study had seen gains reach scale.

The baseline is what separates a real gain from a coincidence. Spend two to four weeks before rollout measuring cycle time, rework rate, error rate, and output quality on the chosen workflow. Keep everyone on the old process during that period. Then introduce the AI tool to one team and compare like-for-like. This approach mirrors the controlled design used by AISI, and it is not complex to set up. What it requires is patience. Many owners skip the baseline because they are keen to see results quickly, and that impatience is what makes the eventual case for expanding the rollout so hard to make.

McKinsey also identifies low adoption as a key suppressor of gains. An AI tool that only half the team uses consistently will produce no measurable improvement in aggregate, even where the active users are genuinely faster. Track usage alongside output from the start, or the measurement will mislead you.

Where will you actually see the gains?

The gains show up at task level before they show up anywhere else. Anthropic’s analysis of 100,000 real-world Claude conversations found AI reduced task completion time by 80% in those interactions. McKinsey cites controlled studies showing software developers completed 26% more tasks with AI assistance, call-centre agents resolved 14% more issues per hour, and mid-level professionals spent around 40% less time on routine writing and analysis.

For a UK services firm, four metrics cover the ground that matters: time per task, first-pass quality or error rate, throughput per person per week, and client turnaround time. You do not need all four for a pilot. Two is enough. Choose the pair that maps to the workflow’s actual bottleneck. If the bottleneck is quality, track quality and rework rate. If it is speed, track time per task and throughput.

After 30 days, re-measure. After 60 days, re-measure again. The Ada Lovelace Institute’s policy briefing on measuring AI makes clear that one-off gains can reflect novelty rather than genuine, lasting improvement. Regular remeasurement is what distinguishes a real change from a short-term bump.

Convert any findings into money only after the operational picture is stable. For a services firm, the financial upside usually comes from fewer hours spent on low-value tasks, faster turnaround, or increased capacity, not from headcount reduction.

When should you start measuring, and when is it too early?

Start measuring when you have three things in place: a repeatable workflow where time and quality are observable, a small team willing to run the experiment consistently, and two to four weeks of baseline data collected before the AI tool goes live. Without all three conditions, any comparison you make will be inconclusive. Many pilots fail to show a clear result because the baseline was never established.

It is too early to measure when the workflow itself is still changing. If your team is adapting their processes at the same time as they are using AI, you are measuring two things at once and cannot separate them. Similarly, if the main bottleneck is an upstream approval chain or client sign-off cycle rather than the task itself, AI is unlikely to move the main constraint. Measuring effort saved on the task will look useful in isolation and mislead you about what the workflow can actually produce.

A workflow that is already highly standardised may show that AI shifts work rather than removes it. The gains are real but smaller, and often hard to distinguish from the gains you would have seen from process improvement alone. Snowflake’s UK research found that companies struggling to scale AI productivity gains often push adoption before the evidence base from early pilots is solid.

What governance checks belong alongside the measurement?

Data protection and security sit alongside the measurement process, not after it. If the workflow you are testing involves personal data belonging to clients or employees, UK GDPR and the Data Protection Act 2018 apply. The ICO’s guidance is clear: AI does not exempt an organisation from data protection duties, and where there is high risk, a Data Protection Impact Assessment is expected before the AI use begins.

The NCSC advises treating AI security as part of normal cyber hygiene. In practice, this means classifying what data staff are allowed to share with third-party AI systems, setting access controls, and testing for prompt injection and data leakage. These controls are worth putting in place before the pilot, not after. An AI tool that accelerates work while exposing client data is a governance failure. For regulated UK firms, the FCA expects AI use to sit within existing obligations on governance, outsourcing, operational resilience, and fair treatment of customers, so check with your compliance function before expanding any pilot that touches client-facing outputs. If you sell into the EU or use AI systems that fall within the EU AI Act’s scope, high-risk system requirements on risk management and human oversight will also shape how you configure the measurement.

Before scaling any pilot, publish a clear internal list of what staff should not use AI for: final legal or regulatory submissions without human review, outputs involving special-category personal data where governance is not yet in place, and any task where a hallucination or confidentiality failure would carry serious commercial or regulatory cost.

The measurement framework itself is not complicated. Pick one workflow. Set a baseline. Choose two to four metrics. Run a small pilot with one team. Track usage alongside output. Re-measure at 30, 60, and 90 days. That sequence works whether you have five staff or fifty.

What makes it hard is the patience it demands. The owners who get reliable data are the ones who resist the urge to introduce the AI tool before the baseline is in place. If you can hold that discipline for four weeks before rollout, the rest of the measurement process follows naturally.

Sources

- UK AI Security Institute (2025). "AI and the Future of Work: measuring AI-driven productivity gains for workplace tasks." RCT-designed pilot study tracking quality, time, and points per minute across task types. The primary source for task-level measurement methodology referenced in this post. https://www.aisi.gov.uk/blog/ai-and-the-future-of-work-measuring-ai-driven-productivity-gains-for-workplace-tasks - McKinsey UK (2025). "The new productivity paradox." Synthesises controlled study evidence on developer, call-centre, and professional productivity gains and identifies low adoption as a key barrier to scaling. https://www.mckinsey.com/uk/our-insights/uk-blog/the-new-productivity-paradox - Ada Lovelace Institute (2023). "Measuring up: responsible evaluation of AI." Policy briefing on the measurement challenges and risks of evaluating AI productivity claims, including the novelty-effect problem. https://www.adalovelaceinstitute.org/policy-briefing/measuring-up/ - Snowflake / YouGov (2025). UK enterprise survey of 500 senior decision-makers. Found 45% reported early or use-case-specific productivity gains, but only 23% had achieved gains at scale. Reported via ERP.today. https://erp.today/snowflake-research-finds-uk-firms-investing-in-ai-but-struggling-to-scale-productivity-gains/ - Anthropic (2025). "Estimating Productivity Gains from Using Claude." Analysis of 100,000 real-world conversations estimating 80% reduction in task completion time in those interactions. https://www.anthropic.com/research/estimating-productivity-gains - Information Commissioner's Office (2025). Artificial intelligence guidance under UK GDPR. Sets out data protection obligations for organisations using AI, including lawful basis, transparency, and data minimisation. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - Information Commissioner's Office (2025). Data Protection Impact Assessments. Sets out when a DPIA is required, including for high-risk AI processing involving personal data. https://ico.org.uk/for-organisations/data-protection-impact-assessments/ - National Cyber Security Centre (2025). Artificial intelligence guidance collection. Covers data classification, access control, prompt injection, and data leakage risks for organisations deploying AI tools. https://www.ncsc.gov.uk/collection/artificial-intelligence - Financial Conduct Authority (2025). AI guidance for regulated firms. Sets out how AI use must fit within obligations on governance, outsourcing, operational resilience, and fair treatment of customers. https://www.fca.org.uk/firms/ai - EU Artificial Intelligence Act (2024). Regulation (EU) 2024/1689. Sets requirements for high-risk AI systems on risk management, data governance, logging, transparency, and human oversight. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Frequently asked questions

How long does it take to measure AI productivity gains properly?

Plan for at least two to four weeks of baseline measurement before you introduce the AI tool, then four to six weeks of post-rollout tracking before drawing conclusions. Re-measure at 30, 60, and 90 days after rollout. Evidence from the Ada Lovelace Institute highlights that one-off measurements often capture novelty effects rather than genuine, sustained productivity improvement.

What metrics should I track to measure AI productivity in a small services firm?

Start with two to four metrics: time per task, first-pass quality or error rate, throughput per person per week, and client turnaround time. Two well-measured metrics will give you more reliable data than six poorly tracked ones. The UK AI Security Institute's 2025 pilot tracked quality, time, and task efficiency separately because gains commonly appear on only one dimension rather than all at once.

Do data protection rules apply when I test AI tools with client data?

Yes. UK GDPR and the Data Protection Act 2018 apply to any AI use involving personal data, including pilot tests. You need a lawful basis for the processing, and the ICO expects a Data Protection Impact Assessment where processing is likely to result in high risk to individuals. The NCSC also advises firms to control what data staff share with third-party AI systems before any pilot begins.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation