The owner uploaded the full folder of client contracts to a new AI tool, expecting to ask it questions about renewal dates, indemnity clauses and notice periods across the whole back catalogue. The first dozen questions came back fine. The next dozen returned answers that ignored every contract signed before 2021. He checked the files. They were all there. They all opened cleanly in his PDF reader. He could read them on screen with his own eyes.
The AI could not.
The reason is the kind of distinction owners rarely think about until it bites them. Two PDFs that look identical on screen can be structurally different at the file level, and the difference decides whether an AI tool can read them or has to give up. This piece walks through what that distinction actually is, what kinds of document you most likely hold, and how to decide which ones are worth the pre-processing effort.
What does AI-readable actually mean at the file level?
It means the document contains a real text layer that a machine can extract directly, not just pixels rendered on screen. Native digital PDFs, the kind printed from Word or generated by a template tool, carry embedded text objects, fonts and structural metadata that AI systems can read without rendering the page as an image first. Scanned PDFs do not. They are pictures wrapped in a PDF container.
When you open either one in Adobe Reader or Preview, they look the same because the reader software has already converted the contents to a visual display. At the data layer underneath, they are different objects. PyMuPDF’s technical write-up on native versus vision-model extraction explains why this matters, the native PDF feeds machine-readable text straight into downstream tools, the scanned PDF yields nothing until OCR has added a text layer. The owner’s pre-2021 contracts almost certainly fall into the second group, scanned during an earlier digitisation effort and never given a searchable text layer.
A smaller third failure mode also exists. Some older PDFs use embedded fonts with custom character encodings that cannot be mapped to standard character sets, so text extraction returns garbage rather than readable words. IBM’s documentation on this for the FileNet platform is the cleanest technical reference. OCR will not help, because OCR is not the problem, the file’s text is technically there but cannot be decoded.
Which document categories does your SME actually hold?
Three, in roughly this proportion across a typical owner-operated business. Native digital documents, generated from Word, Google Workspace or a template tool and saved straight to PDF. Scanned or image-only documents, paper that has been digitised at some point and now lives as a picture inside a PDF wrapper. And handwritten or photographed material, including signed contract pages, whiteboard photos from meetings, and forms with handwritten fields.
Native digital documents need nothing. The text is there, the AI can read it, you can move straight to the workflow. Scanned documents need OCR before any AI tool can use them, and ABBYY’s reference on PDF types is the cleanest plain-English explanation of why. Handwritten material occupies a middle ground where recovery varies sharply by script quality, ink clarity, page age and layout density. Photographs of whiteboards, counterintuitively, are often more recoverable than handwritten notes because marker contrast is high and the background is uniform.
The mix in your own business depends on history. A firm that has been digitally native for fifteen years holds mostly category one. A firm that ran a paper-to-PDF digitisation project five to ten years ago will hold a meaningful chunk of category two, and the contracts the AI ignored are almost certainly in there. Category three only matters if your workflows touch signed pages, intake forms or meeting captures.
Where will you actually meet this problem?
In the gap between what the AI tool claims and what it does on your actual files. Vendors describe their products as accepting PDFs, which is technically true. What they rarely say is that the tool reads the text layer if there is one and silently ignores the file if there is not. The result is a workflow that looks fine on recent documents and quietly fails on the older half.
The pattern shows up sharpest in three places. Legacy contract review, where pre-digitisation files are scans. Compliance and audit work, where older policy documents and signed approvals were filed as images. And document search across the shared drive, where the search tool returns hits from the last few years but appears to think the older folders are empty. None of these are AI failures in the model sense. They are file-format failures the model never had the chance to address.
When should you OCR a document set, and when should you leave it alone?
When three conditions all hold. Volume is high enough that the per-document cost is sensible, the AI use case for the documents is clear, and the cost of an extraction error is one you can plan for. GRM Document Management puts basic scanning at roughly seven to twelve pence per page and OCR at one to five pence on top, so a ten-thousand-page contract archive sits in the eight hundred to seventeen hundred pound range.
Volume matters because the fixed setup cost is real. A few dozen scanned files can go through Word’s built-in OCR or macOS Preview’s embed-text feature at no incremental cost, and that is the right tool for small jobs. Use case clarity matters because OCR for search (“find the contract with client X”) tolerates lower accuracy than OCR for extraction (“read the renewal date into our pipeline”). Error cost matters because LlamaIndex’s analysis of OCR accuracy shows that 99 per cent character accuracy can still produce field-level mistakes that need human review.
The honest answer for many SMEs is to OCR one or two high-value archives, often the contract back-catalogue or the client correspondence file, and leave the long tail of miscellaneous material alone until a clear use case surfaces.
What does this connect to in the rest of the cluster?
It connects to the wider question of how AI-ready your document estate actually is. This post sits in the documents and unstructured content section of the broader data and knowledge readiness cluster, alongside related pieces on SOPs AI can actually read, the shared drive problem, and naming conventions for AI retrieval.
The underlying argument across all of them is the same. AI projects fail at the data and document layer far more often than at the model layer, and the fix is proportionate triage rather than a wholesale digitisation programme. The work is to identify which document categories you hold, which of them have a clear AI use case attached, and which ones justify the pre-processing effort. The pre-2021 contracts that triggered this piece are a textbook example. Solvable, but only once the file-level diagnosis is right.
If you are looking at a document estate of your own and not sure how to size the work, that is the kind of pacing conversation worth having before any OCR invoice gets raised. Book a conversation.



