What is computer vision? Why it matters for your business

A production manager checking quality data on a tablet beside a working conveyor line in a small UK precision-engineering factory
TL;DR

Computer vision is the branch of AI that interprets images and video. In 2026 it covers five core tasks (classification, object detection, OCR, segmentation, biometric recognition) at price points from £5 a month on a cloud API to £15,000 for a working no-code system. The make-or-break question is not which vendor or which model, it is how many images the system was trained on under your real operating conditions.

Key takeaways

- Computer vision is software that decides what is in an image, where it is, what it says, or whether it matches a target. It is already inside accounting OCR, shelf cameras, and claim photo apps. - The five core tasks are classification, object detection, OCR and document AI, segmentation, and facial or biometric recognition. Pick the task before you pick the vendor. - In 2025, 42% of UK manufacturers scrapped AI projects, with vision quality control among the most common casualties. The dominant failure mode was training on clean lab images and shipping into a real production line. - Multimodal LLMs handle low-volume, varied visual tasks without training. Specialist models on edge hardware win on high-volume, fixed tasks. Knowing which side your problem sits on is the procurement decision. - Facial recognition is the only computer vision category with genuinely tight UK GDPR constraints. Quality control, OCR, shelf monitoring, and claim photos sit under standard data protection rules.

The managing director of a 60-staff precision-engineering firm in the West Midlands had two quotes on her desk and a problem she wanted gone. Roughly 3% of finished parts were shipping with surface defects that customers later rejected. Vendor A, a London consultancy, wanted £85,000 for a custom defect-detection model trained on 200 internally captured images. Vendor B quoted £4,500 for a model trained on 800 images captured across three shifts, including the morning oil-spray cycle, running on a £499 device at the line itself.

Both vendors said the same five words on the cover slide: AI-powered computer vision quality inspection. She was trying to work out which question mattered most: the model, the price, or the data the system had been trained on.

What is computer vision?

Computer vision is the branch of AI that interprets images and video and acts on what it sees. Software breaks each picture into millions of pixels, runs them through a neural network (usually a Convolutional Neural Network), and outputs a decision: a label, a location, extracted text, a pixel-level boundary, or an identity match. It is one of the major sibling branches of machine learning alongside language and structured data work.

The phrase covers five distinct tasks that do not interchange. Image classification gives one label per image: pass or fail on a weld, normal or abnormal on an X-ray. Object detection adds location: how many of which SKUs are on which shelf, where on the part the defect sits. OCR and document AI extract structured fields from invoices, contracts, and property documents at 95 to 99% field-level accuracy in 2026, against 85 to 90% from traditional OCR three years earlier. Image segmentation draws pixel-level boundaries, used in medical imaging and precise defect outlining. Facial and biometric recognition identifies or authenticates individuals.

The point worth holding on to is that computer vision is already in tools many UK firms already buy. Accounting OCR that auto-fills invoice fields. Retail shelf cameras. Insurance claim photo apps. Security footage analysis. Often the vendor does not call it computer vision, because by 2026 the phrase has stopped doing marketing work.

Why it matters for your business

It matters because the cost-to-capability ladder has compressed dramatically in eighteen months, and the question is no longer whether you can afford computer vision. The question is whether the cheapest version that fits your problem is good enough. There are now four working tiers below the bespoke build that used to dominate the market.

Cloud platform APIs sit at the bottom. Google Cloud Vision charges $1.50 to $2.25 per 1,000 units with the first 1,000 free; Azure Computer Vision S1 starts at $375 per 500,000 transactions; AWS Rekognition’s content moderation runs at $0.10 per minute. A UK conveyancing firm scanning 50 invoices a month can run on the Google free tier indefinitely.

No-code platforms sit one rung up. Roboflow, Landing AI, Averroes, Nanonets, Microsoft Lobe AI. Total project cost typically £2,000 to £15,000 for a working system including hardware. Above that sit open-source libraries (OpenCV, YOLO; zero licence cost, £10,000 to £50,000 in implementation labour) and edge devices (Luxonis OAK-D at around £200, NVIDIA Jetson Orin Nano at around £499) that avoid recurring cloud fees on 24/7 video streams. The decision is not “which is best”, it is “which one fits the volume, the privacy, and the in-house engineering you actually have”.

Where you will meet it

You will meet it in the back office before you meet it on the shop floor. Modern accounts payable software has been quietly running computer vision on every invoice you upload for at least three years. Rossum, Klippa, Affinda and Hyperscience are purpose-built for it; Tipalti, Xero and the larger ERP suites embed it. The vendor calls it “smart capture” or “AI invoice processing” and the underlying technology is OCR plus document AI.

You will meet it in retail, where Focal Systems’ Shelf AI platform analyses five million retail images a day, giving client retailers hourly visibility into what is on the shelves versus what should be there. You will meet it in insurance, where GEICO’s Easy Photo Estimate lets policyholders photograph damage and uses computer vision to assist repair estimates before a human adjuster signs them off. You will meet it in manufacturing, where modern vision systems detect defects as small as 0.1 millimetres at 99.8% accuracy and make pass-or-fail decisions in under 50 milliseconds.

You will increasingly meet it inside multimodal LLMs. GPT-5.4, Claude Opus 4.7, Gemini 3 Pro and Qwen3.5 can analyse photographs, answer questions about them, and extract information without any custom training. For a one-off photo or a varied document check, a multimodal LLM is often the fastest path. For a fixed task running thousands of times a day, a specialist model on edge hardware wins on accuracy, speed, and cost per transaction.

When to ask about it, when to ignore it

Ask about it when the visual task is repetitive, high-volume, and performed under conditions humans find tiring or unsafe: production-line inspection, shelf scanning across many stores, batch document processing, claim photo triage. These are the workloads where humans miss things, vary by shift, and where the economic case for vision is usually clear. A growing share of UK firms now run computer vision in one of these patterns.

Ignore it when the task is one-off, varied, and humans are already accurate. A 12-person legal practice reviewing five contracts a day does not need to deploy computer vision, the partners read faster than the model can be trained. Ignore it also when the volume is too low to justify the infrastructure: a vendor pitching a £15,000 deployment for a process that runs eight times a month is selling you the technology, not solving your problem.

The trap to watch for is vendor demos shot in lab conditions. The single useful procurement question is “how many images did you train on, captured in our actual operating environment, across how many shifts and product variants?” If the answer is under 500 or all from one shift in one season, the model will likely fail when it meets reality. Inadequate training data is the dominant reason 42% of UK manufacturing AI projects were scrapped in 2025, and vision quality control was among the common casualties.

Multimodal AI is the umbrella for systems that handle vision, text, and sometimes audio together. The 2026 multimodal LLMs are increasingly used for the visual tasks that used to require a dedicated computer vision model. The two fields are converging.

Foundation models include vision-specific foundation models like CLIP and DINO that other vendors fine-tune for specific tasks. Transfer learning, where a model pre-trained on millions of public images is then fine-tuned on a few hundred of yours, is what makes 50-image deployments practical.

Explainable AI is the discipline that lets you audit why a vision model rejected a particular part or flagged a particular face. The EU AI Act classifies remote biometric identification and emotion recognition as high-risk; the MHRA opened a call for evidence on AI in healthcare with a 2 February 2026 deadline. For typical SME use cases (manufacturing QC, shelf monitoring, document OCR, claim photos) the regulatory line is standard GDPR. For anything involving identifying individuals from biometric features, it is special category data under UK GDPR with a default presumption of prohibition.

The point of all this vocabulary is practical. When the next vendor walks in with a quote, ask about the training data before the model, and the price after both. That is what separates a £500 working pilot from a six-figure write-off. If you would like to talk it through, book a conversation.

Sources

Ultralytics (2026). Computer vision in 2026: models, tools and use cases. Field overview and 2026 trend reference. https://www.ultralytics.com/blog/everything-you-need-to-know-about-computer-vision-in-2025 TDWI (2025). What is computer vision? A beginner's guide to how AI sees the world. Plain-English explainer of the underlying neural network approach. https://tdwi.org/blogs/ai-101/2025/09/what-is-computer-vision.aspx Google Cloud (2026). OCR with Google AI. Reference for OCR and document AI capability and 200-language support. https://cloud.google.com/use-cases/ocr Google Cloud (2026). Cloud Vision API pricing. Per-1,000-unit pricing for label, text, and object detection. https://cloud.google.com/vision/pricing Amazon Web Services (2026). Amazon Rekognition pricing. Cloud vision API rate card including content moderation per-minute pricing. https://aws.amazon.com/rekognition/pricing/ Microsoft Azure (2026). Azure Computer Vision pricing. Reference for S1 tier transactional pricing. https://azure.microsoft.com/en-us/pricing/details/computer-vision/ Primotly (2026). Most cost-effective computer vision tools for SMEs in 2026. Reference for the no-code platform tier and edge hardware price points. https://primotly.com/article/the-most-cost-effective-computer-vision-tools-for-small-businesses-an-expert-guide CompareTheCloud. Five things UK manufacturers got wrong when they first tried AI on the production line. Source for the 42% scrapped-projects figure and training-data failure mode. https://www.comparethecloud.net/articles/five-things-uk-manufacturers-got-wrong-ai-production-line GDPR Advisor. GDPR and facial recognition: privacy implications and legal considerations. Reference for biometric data as special category under UK GDPR. https://www.gdpr-advisor.com/gdpr-and-facial-recognition-privacy-implications-and-legal-considerations/ EU AI Act (2024). Annex III: high-risk AI systems. Source for high-risk classification of remote biometric identification and emotion recognition. https://artificialintelligenceact.eu/annex/3/ MHRA (2025). MHRA seeks input on AI regulation at pivotal moment for healthcare. Source for the 2 February 2026 call for evidence deadline. https://www.gov.uk/government/news/mhra-seeks-input-on-ai-regulation-at-pivotal-moment-for-healthcare

Frequently asked questions

How many images do I actually need to train a usable computer vision model?

Far fewer than the marketing implies. Thanks to transfer learning, where the model starts from one already trained on millions of public images, 50 to 200 well-chosen photos from your actual operating environment can deliver production accuracy. The catch is that those images have to cover every lighting condition, contamination state, product variant, and shift the system will see. Two hundred clean lab photos will fail; eighty real shop-floor photos across three shifts will often work.

When should I use a multimodal LLM versus a specialist computer vision model?

Use a multimodal LLM (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro) when the visual task is varied and low-volume. Examples: ad-hoc claim photos, one-off document checks, "what is wrong with this packaging?". Use a specialist model on edge hardware when the task is fixed and high-volume. Examples: every part on a production line, every shelf in 200 stores. The specialist wins on accuracy, speed, and cost per transaction once volume is repetitive.

Does facial recognition stop me using computer vision in my business?

No, and the conflation costs UK firms working projects every year. Facial recognition uses biometric data, which UK GDPR classifies as special category data with a presumption of prohibition. That is genuinely tight. Quality inspection, shelf monitoring, OCR on invoices, and damage assessment from claim photos do not involve biometric identification at all. They sit under standard GDPR: lawful basis, transparency, data minimisation. Typical SME use cases are in the second group.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation