The owner asks the AI assistant to onboard a new client using the standard process. Twenty minutes later she comes back to a workflow that looks plausible, contains four steps that do not exist in her business, and skips the credit check entirely. The SOP it read was the one her operations manager wrote eighteen months ago. Three new starters have used it without incident. They knew what “verify the client” meant in her firm. The AI did not, so it invented a method that sounded professional and was wrong.
The AI was working from what was actually written down. The fix is small, three changes to format and one change to how completeness is checked, and the result reads better for humans too.
Why do AI tools struggle with human-written SOPs?
Human SOPs rest on a foundation of shared context that is almost never on the page. A new starter brings accumulated knowledge of the business and the customers, and fills hundreds of small gaps daily because they have been in the room long enough to absorb how the work runs. An AI system reads only what is written, and when it meets a gap, it infers and proceeds.
Researchers at Suffolk University describe this behaviour as confabulation rather than hallucination. The model fills missing context with plausible-sounding inference. The output sounds confident and is sometimes invented. Virtasant’s 2024 analysis found that only 6 per cent of enterprises fully trust AI agents on core processes, and 84 per cent of those surveyed pointed at inadequate workflow documentation as the root cause. Many owners assume the documentation gap is smaller than it actually is, because new starters have been quietly filling it for years.
What are the three format changes that make an SOP AI-readable?
An SOP that AI can read reliably is almost always one that works better for humans too. The substance of the procedure does not change. The gaps where human SOPs rely on judgement are concentrated in three places, and making those gaps explicit improves the document for both audiences. It is the same discipline taken slightly more seriously, not extra work.
The first change is explicit decision points. Conditional language like “if the customer is difficult, escalate” or “if the order is large, check stock first” prompts a human to think and leaves an AI guessing. AWS’s work on agent SOPs applies the constraint language from IETF’s RFC 2119, the MUST, SHOULD, and MAY keywords used in technical specifications, to make required and permitted actions unambiguous. “If the customer response indicates dissatisfaction or raises a concern not listed in section 4, the agent MUST escalate to the customer service manager and MUST NOT attempt resolution independently” reads as cleanly to a new starter as to a model.
The second change is named inputs and outputs at every step. “Process the refund” assumes the reader knows to find the order number, payment method, and customer email. An AI has to be told. Each step should state what information is required to begin (inputs, customer ID, order number, refund reason), what system state is assumed (the order has been marked returned in the warehouse system), and what the step produces (outputs, refund processed in Stripe, confirmation email queued, order record updated). Naming the inputs and outputs also catches steps that depend on information the previous step did not produce, which is the commonest failure mode in real procedures.
The third change is a defined terms sidebar. SOPs refer to “the standard response”, “our normal practice”, or “next working day” without saying what those mean in your business. A human in the firm knows. An AI does not, and it will reach for its training data instead, which may or may not match. The sidebar does not need to be long. It needs to say what counts as a “large order”, what the threshold is for “urgent”, and whether “next working day” excludes only weekends or also bank holidays. The UK Government Data Quality Framework treats clear definitions of terminology as foundational, not optional.
How do you tell if your SOP actually works before an AI fails on it?
The cheapest validation is the competence test. Hand the SOP to someone who knows the kind of work but not your specific processes. Ask them to follow it step by step without asking for help. Watch where they pause. The pauses reveal gaps, the questions reveal ambiguities. Some of those gaps will not bother a human reader because they will ask a colleague. The same gaps are failure points for AI.
The method aligns with how regulators think about validation. The FDA’s framework for method validation rests on demonstrating, with objective evidence, that a trained operator can follow the procedure as written and produce a consistent result. The operator can be human or machine, and the test is whether the procedure can be executed without invention. A second use of the test is catching steps that look sequential but are not. “Generate an invoice” might assume the order has been confirmed and stock allocated. If those dependencies are not stated, an AI might attempt the steps in parallel, producing an invalid invoice. Naming the input dependencies makes the constraint explicit.
Where does retrieval discipline come in?
Writing an AI-readable SOP only solves half the problem. The other half is making sure the AI can find it, identify the current version, and know whether it has been updated. The Information Commissioner’s Office’s guidance for small business names the failure mode directly, information that cannot be found is worse than information that does not exist. Three disciplines close it.
The first is a single source location: one place that holds the authoritative version, whether a Notion workspace, a Confluence space, a Google Drive folder, or a dedicated SOP tool. Other systems may cache a copy or link to it, but they are not the source of truth. IBM’s framing of a system of record versus a source of truth applies cleanly. The second is a stable naming convention. Princeton’s records management guidance recommends a pattern like function-process-version-date, for example “Sales_NewClientOnboarding_v03_20260320”, consistently applied. The third is versioned updates with a visible change log, so an AI tool that has cached an older version can check whether it has been superseded and reload if it has.
When is this worth doing and when is it not?
Not every SOP needs to be AI-readable. The upgrade carries real upfront cost. For procedures that run rarely, change often, or involve heavy judgement where AI was never going in the loop, the investment is not justified. For procedures that run frequently, follow rules, and have consequences when they go wrong, it almost always is. The honest question is whether an AI tool will be asked to read this SOP in the next twelve months.
The working rule is to start with three to five procedures that are high volume, repeatable, rule-based, and where consistency matters. New client onboarding, refund processing, meeting scheduling, invoice creation, and standard customer questions are the usual candidates. Forrester’s research on strategic AI readiness recommends the same shape, start small, scale smart, and embed governance by design. Virtasant’s analysis flags the opposite failure mode, organisations that try to automate too broadly and end up with sprawl. For an owner-operated firm the trap is the same in miniature, doing twenty SOPs badly rather than five well.
The proportionate first move is one procedure, two hours, the three format changes, and the competence test. If that pays off in faster execution or fewer corrections, take the next one. The data side of the same picture is covered in why AI projects fail at data, not at AI and the one-week data and knowledge audit; the post on capturing tacit knowledge before key people leave sits alongside this one. If you would rather work through your top three SOPs with someone who has helped other owner-managed firms do it, book a conversation.



