The managing director of a 60-staff precision-engineering firm in the West Midlands had two quotes on her desk and a problem she wanted gone. Roughly 3% of finished parts were shipping with surface defects that customers later rejected. Vendor A, a London consultancy, wanted £85,000 for a custom defect-detection model trained on 200 internally captured images. Vendor B quoted £4,500 for a model trained on 800 images captured across three shifts, including the morning oil-spray cycle, running on a £499 device at the line itself.
Both vendors said the same five words on the cover slide: AI-powered computer vision quality inspection. She was trying to work out which question mattered most: the model, the price, or the data the system had been trained on.
What is computer vision?
Computer vision is the branch of AI that interprets images and video and acts on what it sees. Software breaks each picture into millions of pixels, runs them through a neural network (usually a Convolutional Neural Network), and outputs a decision: a label, a location, extracted text, a pixel-level boundary, or an identity match. It is one of the major sibling branches of machine learning alongside language and structured data work.
The phrase covers five distinct tasks that do not interchange. Image classification gives one label per image: pass or fail on a weld, normal or abnormal on an X-ray. Object detection adds location: how many of which SKUs are on which shelf, where on the part the defect sits. OCR and document AI extract structured fields from invoices, contracts, and property documents at 95 to 99% field-level accuracy in 2026, against 85 to 90% from traditional OCR three years earlier. Image segmentation draws pixel-level boundaries, used in medical imaging and precise defect outlining. Facial and biometric recognition identifies or authenticates individuals.
The point worth holding on to is that computer vision is already in tools many UK firms already buy. Accounting OCR that auto-fills invoice fields. Retail shelf cameras. Insurance claim photo apps. Security footage analysis. Often the vendor does not call it computer vision, because by 2026 the phrase has stopped doing marketing work.
Why it matters for your business
It matters because the cost-to-capability ladder has compressed dramatically in eighteen months, and the question is no longer whether you can afford computer vision. The question is whether the cheapest version that fits your problem is good enough. There are now four working tiers below the bespoke build that used to dominate the market.
Cloud platform APIs sit at the bottom. Google Cloud Vision charges $1.50 to $2.25 per 1,000 units with the first 1,000 free; Azure Computer Vision S1 starts at $375 per 500,000 transactions; AWS Rekognition’s content moderation runs at $0.10 per minute. A UK conveyancing firm scanning 50 invoices a month can run on the Google free tier indefinitely.
No-code platforms sit one rung up. Roboflow, Landing AI, Averroes, Nanonets, Microsoft Lobe AI. Total project cost typically £2,000 to £15,000 for a working system including hardware. Above that sit open-source libraries (OpenCV, YOLO; zero licence cost, £10,000 to £50,000 in implementation labour) and edge devices (Luxonis OAK-D at around £200, NVIDIA Jetson Orin Nano at around £499) that avoid recurring cloud fees on 24/7 video streams. The decision is not “which is best”, it is “which one fits the volume, the privacy, and the in-house engineering you actually have”.
Where you will meet it
You will meet it in the back office before you meet it on the shop floor. Modern accounts payable software has been quietly running computer vision on every invoice you upload for at least three years. Rossum, Klippa, Affinda and Hyperscience are purpose-built for it; Tipalti, Xero and the larger ERP suites embed it. The vendor calls it “smart capture” or “AI invoice processing” and the underlying technology is OCR plus document AI.
You will meet it in retail, where Focal Systems’ Shelf AI platform analyses five million retail images a day, giving client retailers hourly visibility into what is on the shelves versus what should be there. You will meet it in insurance, where GEICO’s Easy Photo Estimate lets policyholders photograph damage and uses computer vision to assist repair estimates before a human adjuster signs them off. You will meet it in manufacturing, where modern vision systems detect defects as small as 0.1 millimetres at 99.8% accuracy and make pass-or-fail decisions in under 50 milliseconds.
You will increasingly meet it inside multimodal LLMs. GPT-5.4, Claude Opus 4.7, Gemini 3 Pro and Qwen3.5 can analyse photographs, answer questions about them, and extract information without any custom training. For a one-off photo or a varied document check, a multimodal LLM is often the fastest path. For a fixed task running thousands of times a day, a specialist model on edge hardware wins on accuracy, speed, and cost per transaction.
When to ask about it, when to ignore it
Ask about it when the visual task is repetitive, high-volume, and performed under conditions humans find tiring or unsafe: production-line inspection, shelf scanning across many stores, batch document processing, claim photo triage. These are the workloads where humans miss things, vary by shift, and where the economic case for vision is usually clear. A growing share of UK firms now run computer vision in one of these patterns.
Ignore it when the task is one-off, varied, and humans are already accurate. A 12-person legal practice reviewing five contracts a day does not need to deploy computer vision, the partners read faster than the model can be trained. Ignore it also when the volume is too low to justify the infrastructure: a vendor pitching a £15,000 deployment for a process that runs eight times a month is selling you the technology, not solving your problem.
The trap to watch for is vendor demos shot in lab conditions. The single useful procurement question is “how many images did you train on, captured in our actual operating environment, across how many shifts and product variants?” If the answer is under 500 or all from one shift in one season, the model will likely fail when it meets reality. Inadequate training data is the dominant reason 42% of UK manufacturing AI projects were scrapped in 2025, and vision quality control was among the common casualties.
Related concepts
Multimodal AI is the umbrella for systems that handle vision, text, and sometimes audio together. The 2026 multimodal LLMs are increasingly used for the visual tasks that used to require a dedicated computer vision model. The two fields are converging.
Foundation models include vision-specific foundation models like CLIP and DINO that other vendors fine-tune for specific tasks. Transfer learning, where a model pre-trained on millions of public images is then fine-tuned on a few hundred of yours, is what makes 50-image deployments practical.
Explainable AI is the discipline that lets you audit why a vision model rejected a particular part or flagged a particular face. The EU AI Act classifies remote biometric identification and emotion recognition as high-risk; the MHRA opened a call for evidence on AI in healthcare with a 2 February 2026 deadline. For typical SME use cases (manufacturing QC, shelf monitoring, document OCR, claim photos) the regulatory line is standard GDPR. For anything involving identifying individuals from biometric features, it is special category data under UK GDPR with a default presumption of prohibition.
The point of all this vocabulary is practical. When the next vendor walks in with a quote, ask about the training data before the model, and the price after both. That is what separates a £500 working pilot from a six-figure write-off. If you would like to talk it through, book a conversation.



