PDFs and scans, what AI can read

PDFs, scans and image-only documents, what AI can and cannot read

May 11, 2026

TL;DR

Native-text PDFs are usually fine for AI tools to read directly. Scanned PDFs are often image-only and require OCR before any AI can use them. Handwritten material is variable, photographs of whiteboards are surprisingly good. The owner's job is not to OCR everything, it is to identify which document categories you actually hold and decide which ones justify the pre-processing.

Key takeaways

- Two PDFs that look identical on screen can be structurally different. Native digital PDFs contain a real text layer that AI can read directly. Scanned PDFs are images wrapped in a PDF container and are invisible to AI tools until OCR adds a text layer. - A typical SME holds three document categories at once. Native digital, scanned or image-only, and handwritten or photographed. Each needs different handling, and the mix in your shared drive determines how much work AI readiness will actually take. - OCR has stopped being expensive enterprise software. Built-in features in Word and macOS Preview cover small batches at no extra cost. Dedicated services like ABBYY FineReader cover larger volumes at roughly one to five pence per page on top of basic scanning. - Benchmark OCR accuracy figures are measured on clean test sets, not your coffee-stained 2014 contracts. Real-world character error rates are higher, and for financial or compliance documents the bar is field-level accuracy of 99.9 per cent, not character accuracy of 99 per cent. - The triage rule is volume, use case clarity, error cost. OCR high-volume archives with a clear AI workflow attached. Leave low-volume miscellaneous material alone. Test handwritten material on a sample before committing to batch processing.

The owner uploaded the full folder of client contracts to a new AI tool, expecting to ask it questions about renewal dates, indemnity clauses and notice periods across the whole back catalogue. The first dozen questions came back fine. The next dozen returned answers that ignored every contract signed before 2021. He checked the files. They were all there. They all opened cleanly in his PDF reader. He could read them on screen with his own eyes.

The AI could not.

The reason is the kind of distinction owners rarely think about until it bites them. Two PDFs that look identical on screen can be structurally different at the file level, and the difference decides whether an AI tool can read them or has to give up. This piece walks through what that distinction actually is, what kinds of document you most likely hold, and how to decide which ones are worth the pre-processing effort.

What does AI-readable actually mean at the file level?

It means the document contains a real text layer that a machine can extract directly, not just pixels rendered on screen. Native digital PDFs, the kind printed from Word or generated by a template tool, carry embedded text objects, fonts and structural metadata that AI systems can read without rendering the page as an image first. Scanned PDFs do not. They are pictures wrapped in a PDF container.

When you open either one in Adobe Reader or Preview, they look the same because the reader software has already converted the contents to a visual display. At the data layer underneath, they are different objects. PyMuPDF’s technical write-up on native versus vision-model extraction explains why this matters, the native PDF feeds machine-readable text straight into downstream tools, the scanned PDF yields nothing until OCR has added a text layer. The owner’s pre-2021 contracts almost certainly fall into the second group, scanned during an earlier digitisation effort and never given a searchable text layer.

A smaller third failure mode also exists. Some older PDFs use embedded fonts with custom character encodings that cannot be mapped to standard character sets, so text extraction returns garbage rather than readable words. IBM’s documentation on this for the FileNet platform is the cleanest technical reference. OCR will not help, because OCR is not the problem, the file’s text is technically there but cannot be decoded.

Which document categories does your SME actually hold?

Three, in roughly this proportion across a typical owner-operated business. Native digital documents, generated from Word, Google Workspace or a template tool and saved straight to PDF. Scanned or image-only documents, paper that has been digitised at some point and now lives as a picture inside a PDF wrapper. And handwritten or photographed material, including signed contract pages, whiteboard photos from meetings, and forms with handwritten fields.

Native digital documents need nothing. The text is there, the AI can read it, you can move straight to the workflow. Scanned documents need OCR before any AI tool can use them, and ABBYY’s reference on PDF types is the cleanest plain-English explanation of why. Handwritten material occupies a middle ground where recovery varies sharply by script quality, ink clarity, page age and layout density. Photographs of whiteboards, counterintuitively, are often more recoverable than handwritten notes because marker contrast is high and the background is uniform.

The mix in your own business depends on history. A firm that has been digitally native for fifteen years holds mostly category one. A firm that ran a paper-to-PDF digitisation project five to ten years ago will hold a meaningful chunk of category two, and the contracts the AI ignored are almost certainly in there. Category three only matters if your workflows touch signed pages, intake forms or meeting captures.

Where will you actually meet this problem?

In the gap between what the AI tool claims and what it does on your actual files. Vendors describe their products as accepting PDFs, which is technically true. What they rarely say is that the tool reads the text layer if there is one and silently ignores the file if there is not. The result is a workflow that looks fine on recent documents and quietly fails on the older half.

The pattern shows up sharpest in three places. Legacy contract review, where pre-digitisation files are scans. Compliance and audit work, where older policy documents and signed approvals were filed as images. And document search across the shared drive, where the search tool returns hits from the last few years but appears to think the older folders are empty. None of these are AI failures in the model sense. They are file-format failures the model never had the chance to address.

When should you OCR a document set, and when should you leave it alone?

When three conditions all hold. Volume is high enough that the per-document cost is sensible, the AI use case for the documents is clear, and the cost of an extraction error is one you can plan for. GRM Document Management puts basic scanning at roughly seven to twelve pence per page and OCR at one to five pence on top, so a ten-thousand-page contract archive sits in the eight hundred to seventeen hundred pound range.

Volume matters because the fixed setup cost is real. A few dozen scanned files can go through Word’s built-in OCR or macOS Preview’s embed-text feature at no incremental cost, and that is the right tool for small jobs. Use case clarity matters because OCR for search (“find the contract with client X”) tolerates lower accuracy than OCR for extraction (“read the renewal date into our pipeline”). Error cost matters because LlamaIndex’s analysis of OCR accuracy shows that 99 per cent character accuracy can still produce field-level mistakes that need human review.

The honest answer for many SMEs is to OCR one or two high-value archives, often the contract back-catalogue or the client correspondence file, and leave the long tail of miscellaneous material alone until a clear use case surfaces.

What does this connect to in the rest of the cluster?

It connects to the wider question of how AI-ready your document estate actually is. This post sits in the documents and unstructured content section of the broader data and knowledge readiness cluster, alongside related pieces on SOPs AI can actually read, the shared drive problem, and naming conventions for AI retrieval.

The underlying argument across all of them is the same. AI projects fail at the data and document layer far more often than at the model layer, and the fix is proportionate triage rather than a wholesale digitisation programme. The work is to identify which document categories you hold, which of them have a clear AI use case attached, and which ones justify the pre-processing effort. The pre-2021 contracts that triggered this piece are a textbook example. Solvable, but only once the file-level diagnosis is right.

If you are looking at a document estate of your own and not sure how to size the work, that is the kind of pacing conversation worth having before any OCR invoice gets raised. Book a conversation.

Sources

- PyMuPDF (2025). PDF native vs vision models, Gemini 3. Technical comparison of how AI systems extract text from native digital PDFs versus rendering pages as images for vision-language models. Cited for the file-level distinction between embedded text and rendered pixels. https://pymupdf.io/blog/pdf-native-vs-vision-models-gemini-3 - ABBYY (2025). PDF types and OCR learning centre. Vendor reference on the differences between searchable, image-only and hybrid PDF files, and what each one needs for downstream processing. Cited for the three document categories framing. https://pdf.abbyy.com/learning-center/pdf-types/ - IBM (2024). Text extraction can fail for PDF documents that use embedded fonts. Technical documentation for the FileNet platform explaining custom character encodings that break extraction even when OCR is not needed. Cited for the encoding-failure failure mode beyond OCR. https://www.ibm.com/docs/en/filenet-p8-platform/5.6.0?topic=kps-text-extraction-can-fail-pdf-documents-that-use-embedded-fonts - Revolution Data Systems (2024). The truth about AI handwriting recognition in government records. Practitioner analysis of accuracy ranges for modern, historic and degraded handwritten material. Cited for the 95 per cent best-case figure on clean modern script and the much lower rates for nineteenth-century or faded text. https://www.revolutiondatasystems.com/blog/the-truth-about-ai-handwriting-recognition-in-government-records - Evernote (2024). OCR for whiteboard photos. Vendor article explaining why whiteboard photographs perform better than expected when contrast and lighting are reasonable. Cited for the counterintuitive whiteboard accuracy claim. https://evernote.com/ai-transcribe-image-to-text/ocr-for-whiteboard-photos - SparkCo AI (2025). 2025 OCR accuracy benchmark results, deep dive analysis. Independent benchmark of leading OCR services on standard test sets. Cited for the sub-one-per-cent character error rate on clean printed text and the three-to-five per cent rate on handwriting. https://sparkco.ai/blog/2025-ocr-accuracy-benchmark-results-a-deep-dive-analysis - LlamaIndex (2025). OCR accuracy, character versus field-level. Engineering analysis of why character error rate is the wrong metric for business-critical extraction, and why field-level accuracy of 99.9 per cent is the threshold for straight-through processing of invoices, contracts and identity records. https://www.llamaindex.ai/blog/ocr-accuracy - Microsoft Support (2025). Insert scanned text or images into Word. Vendor documentation for Word's built-in OCR on imported PDFs, including the explicit accuracy caveats versus dedicated OCR services. Cited as the entry-level option for small batches. https://support.microsoft.com/en-us/office/insert-scanned-text-or-images-into-word-b4ae150e-319f-4e18-b27b-418f1d690823 - Apple Discussions (2024). macOS Sonoma embed text feature in Preview. User-confirmed documentation that macOS now ships OCR in the Preview export panel, adding a searchable text layer to PDFs without third-party software. Cited as the Mac-side built-in equivalent. https://discussions.apple.com/thread/256105838 - GRM Document Management (2024). Document digitization ROI case. Industry pricing reference for bulk scanning and OCR services, including the seven-pence to twelve-pence per page scanning range and the one-pence to five-pence OCR addition. Cited for SME-scale costing of an archive. https://www.grmdocumentmanagement.com/blog/document-digitization-roi-case/

Frequently asked questions

How can I tell whether a PDF is native digital or a scan?

Open the file in your PDF reader and try to select a sentence with your cursor. If the text highlights cleanly and you can copy it to a clipboard, the PDF has a real text layer and an AI tool should be able to read it. If your cursor draws a box that selects an image instead of individual words, or if you cannot copy text out, the PDF is image-only and needs OCR before any AI workflow can use it.

Do I need to OCR every old document my business holds?

No. Decide per category, using volume, use case clarity and error cost. A backlog of ten thousand scanned contracts that you want to search by clause justifies the work. A box of miscellaneous meeting photos from a 2019 offsite probably does not. The honest answer for many SMEs is to OCR one or two high-value archives, leave the rest, and accept that some older or handwritten material will need human handling when it surfaces.

How accurate is OCR on handwritten notes and whiteboard photos?

It depends sharply on the input. Clean modern handwriting on a flat page can hit roughly 95 per cent character accuracy with current tools. Faded, cramped or historic script drops well below that. Photographs of whiteboards perform better than people expect when the lighting and contrast are decent, because marker strokes are high-contrast against a uniform background. For anything where accuracy matters, test a sample first and plan for human review of the output.

Written by Dr Dave Heath, AI consultant and business strategist.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

PDFs, scans and image-only documents, what AI can and cannot read

Key takeaways

What does AI-readable actually mean at the file level?

Which document categories does your SME actually hold?

Where will you actually meet this problem?

When should you OCR a document set, and when should you leave it alone?

What does this connect to in the rest of the cluster?

Sources

Frequently asked questions

How can I tell whether a PDF is native digital or a scan?

Do I need to OCR every old document my business holds?

How accurate is OCR on handwritten notes and whiteboard photos?

Ready to talk it through?

If any of this sounds familiar, let's talk.

PDFs, scans and image-only documents, what AI can and cannot read

Key takeaways

What does AI-readable actually mean at the file level?

Which document categories does your SME actually hold?

Where will you actually meet this problem?

When should you OCR a document set, and when should you leave it alone?

What does this connect to in the rest of the cluster?

Sources

Frequently asked questions

How can I tell whether a PDF is native digital or a scan?

Do I need to OCR every old document my business holds?

How accurate is OCR on handwritten notes and whiteboard photos?

Ready to talk it through?

Related reading

From spreadsheets to systems: when an SME outgrows its data setup

Measuring data and knowledge readiness, four questions to revisit each quarter

When to bring in a data consultant, and when not to

If any of this sounds familiar, let's talk.