A marketing agency owner I spoke with recently had been running AI writing tools across her small team for about three months. She was confident something was working. Her team seemed to spend less time on first drafts. Client feedback had improved slightly. But when I asked what had actually changed, she paused. She hadn’t measured it before, during, or after. She had impressions, but no data.
That gap is expensive. Without a baseline, you cannot tell where the AI is actually helping, where it is not, and whether any change will hold beyond the first few weeks of novelty.
This post sets out a practical measurement approach for owner-operated UK services firms with five to fifty staff. The framework draws on task-level evidence from the UK AI Security Institute’s 2025 pilot study and productivity research from McKinsey UK, not on vendor claims.
What does measuring AI productivity gains actually mean?
Measuring AI productivity gains means comparing a small set of before/after task metrics on one workflow, not calculating “AI ROI” across the whole business on day one. The UK AI Security Institute’s 2025 pilot study used a randomised controlled trial to isolate the effect of a large language model against a control group, tracking quality, time taken, and points per minute.
That task-level discipline is the model worth copying. The AISI results show why the task granularity matters: AI improved quality by 22% on a monitoring task and 23% on a technical drafting task, but did not improve speed on either. On a separate interpreting-information task, it improved speed by 42% and points per minute by 102% but showed no quality gain. The lesson is that “AI productivity” is not one thing. Quality and speed improvements show up on different tasks, and measuring only one misses half the picture.
This matters for a small services firm because different workflows have different bottlenecks. Proposal drafting, call summarisation, case triage, and client reporting all behave differently under AI assistance. You need to pick one, measure it properly, and resist the urge to track everything at once.
Why does it matter for your business?
Without a baseline, you cannot separate the effect of the AI tool from normal variation in how your team works week to week. This is a key reason productivity claims often collapse on inspection. McKinsey UK’s 2025 analysis describes a “new productivity paradox”: nearly 60% of UK survey respondents see AI as a productivity opportunity, but only 23% of firms in a UK enterprise study had seen gains reach scale.
The baseline is what separates a real gain from a coincidence. Spend two to four weeks before rollout measuring cycle time, rework rate, error rate, and output quality on the chosen workflow. Keep everyone on the old process during that period. Then introduce the AI tool to one team and compare like-for-like. This approach mirrors the controlled design used by AISI, and it is not complex to set up. What it requires is patience. Many owners skip the baseline because they are keen to see results quickly, and that impatience is what makes the eventual case for expanding the rollout so hard to make.
McKinsey also identifies low adoption as a key suppressor of gains. An AI tool that only half the team uses consistently will produce no measurable improvement in aggregate, even where the active users are genuinely faster. Track usage alongside output from the start, or the measurement will mislead you.
Where will you actually see the gains?
The gains show up at task level before they show up anywhere else. Anthropic’s analysis of 100,000 real-world Claude conversations found AI reduced task completion time by 80% in those interactions. McKinsey cites controlled studies showing software developers completed 26% more tasks with AI assistance, call-centre agents resolved 14% more issues per hour, and mid-level professionals spent around 40% less time on routine writing and analysis.
For a UK services firm, four metrics cover the ground that matters: time per task, first-pass quality or error rate, throughput per person per week, and client turnaround time. You do not need all four for a pilot. Two is enough. Choose the pair that maps to the workflow’s actual bottleneck. If the bottleneck is quality, track quality and rework rate. If it is speed, track time per task and throughput.
After 30 days, re-measure. After 60 days, re-measure again. The Ada Lovelace Institute’s policy briefing on measuring AI makes clear that one-off gains can reflect novelty rather than genuine, lasting improvement. Regular remeasurement is what distinguishes a real change from a short-term bump.
Convert any findings into money only after the operational picture is stable. For a services firm, the financial upside usually comes from fewer hours spent on low-value tasks, faster turnaround, or increased capacity, not from headcount reduction.
When should you start measuring, and when is it too early?
Start measuring when you have three things in place: a repeatable workflow where time and quality are observable, a small team willing to run the experiment consistently, and two to four weeks of baseline data collected before the AI tool goes live. Without all three conditions, any comparison you make will be inconclusive. Many pilots fail to show a clear result because the baseline was never established.
It is too early to measure when the workflow itself is still changing. If your team is adapting their processes at the same time as they are using AI, you are measuring two things at once and cannot separate them. Similarly, if the main bottleneck is an upstream approval chain or client sign-off cycle rather than the task itself, AI is unlikely to move the main constraint. Measuring effort saved on the task will look useful in isolation and mislead you about what the workflow can actually produce.
A workflow that is already highly standardised may show that AI shifts work rather than removes it. The gains are real but smaller, and often hard to distinguish from the gains you would have seen from process improvement alone. Snowflake’s UK research found that companies struggling to scale AI productivity gains often push adoption before the evidence base from early pilots is solid.
What governance checks belong alongside the measurement?
Data protection and security sit alongside the measurement process, not after it. If the workflow you are testing involves personal data belonging to clients or employees, UK GDPR and the Data Protection Act 2018 apply. The ICO’s guidance is clear: AI does not exempt an organisation from data protection duties, and where there is high risk, a Data Protection Impact Assessment is expected before the AI use begins.
The NCSC advises treating AI security as part of normal cyber hygiene. In practice, this means classifying what data staff are allowed to share with third-party AI systems, setting access controls, and testing for prompt injection and data leakage. These controls are worth putting in place before the pilot, not after. An AI tool that accelerates work while exposing client data is a governance failure. For regulated UK firms, the FCA expects AI use to sit within existing obligations on governance, outsourcing, operational resilience, and fair treatment of customers, so check with your compliance function before expanding any pilot that touches client-facing outputs. If you sell into the EU or use AI systems that fall within the EU AI Act’s scope, high-risk system requirements on risk management and human oversight will also shape how you configure the measurement.
Before scaling any pilot, publish a clear internal list of what staff should not use AI for: final legal or regulatory submissions without human review, outputs involving special-category personal data where governance is not yet in place, and any task where a hallucination or confidentiality failure would carry serious commercial or regulatory cost.
The measurement framework itself is not complicated. Pick one workflow. Set a baseline. Choose two to four metrics. Run a small pilot with one team. Track usage alongside output. Re-measure at 30, 60, and 90 days. That sequence works whether you have five staff or fifty.
What makes it hard is the patience it demands. The owners who get reliable data are the ones who resist the urge to introduce the AI tool before the baseline is in place. If you can hold that discipline for four weeks before rollout, the rest of the measurement process follows naturally.



