What is structured output (JSON mode)? The death of regex-and-pray

A person at a desk reviewing a paper invoice next to a laptop showing extracted line items in a small office
TL;DR

Structured output is a feature on every major LLM that lets you define a schema (usually JSON) and guarantees the model's response will conform to it. The mechanism is constrained decoding, which masks any token that would violate the schema before generation. For SME pipelines feeding invoice extracts, ticket classifications, or CRM records into downstream systems, schema compliance moves from roughly 70-85% to 100%, and the regex-and-pray repair layer disappears.

Key takeaways

- Structured output forces the model's response into a schema you define: which fields, what types, which are mandatory, which enum values are allowed. The mechanism, constrained decoding, makes invalid output mathematically impossible. - Three production-grade implementations in 2026: OpenAI Structured Outputs with strict mode, Anthropic JSON outputs and strict tool use, and Google Gemini structured output. Each has subtle dialect differences worth knowing before you build. - The reliability dividend is large. Schema-valid responses parse on the first attempt, retries disappear, and the regex-repair logic that grew over months gets deleted. The cost overhead is real but modest, typically 10 to 30% in tokens across realistic SME workloads. - Schema validates syntax, not semantics. A schema-valid invoice extraction can still hallucinate the total. For high-stakes use, pair structured output with a deterministic check on the values, ideally with the model quoting the source verbatim alongside the extraction. - The decision rule is whether AI output flows into a downstream system without human review. If yes, structured output earns its keep on day one. If a human always checks before action, free-form output is fine.

A 19-staff accountancy firm builds an AI invoice-processing pipeline. The first version uses prompt engineering, the standard ask-and-hope approach: “return the invoice details as JSON”. It works on a good day. About 12% of responses fail to parse cleanly because the model wraps the JSON in markdown code fences, returns a trailing comma, or hallucinates a field name that the accounting system does not recognise.

Staff spend a few hours a week cleaning up those failures, and a small number of corrupted line items reach the accounting system before anyone notices. The team rebuilds the pipeline using structured output with a strict JSON schema. Failure rate drops from 12% to under 1%. The cost per query rises by roughly 15% in tokens. The regex-repair logic that had grown over six months gets deleted in a single afternoon. The owner sees the trade plainly: a small token premium for predictable, audit-ready output.

What is structured output?

Structured output is a feature on every major LLM in 2026 that lets you define a schema (usually JSON Schema or a Pydantic model) and guarantees the model’s response will conform. The mechanism, called constrained decoding, masks any token that would violate the schema before generation, so invalid output is mathematically impossible once strict mode is on.

Why does the reliability dividend matter?

It matters because parsing failures cascade into operational cost the first version of an AI pipeline rarely models. An invoice extractor that fails 12% of the time triggers retries, regex repair, and staff review on the residual edge cases. With strict-mode structured output, every required field is present, every numeric field is a number, and every enum value is one you defined.

The dividend compounds when AI steps chain. If a workflow makes ten sequential decisions and each one parses 99% of the time, the probability of completing the chain without a failure drops to about 90%. With schema enforcement at 100% on each step, end-to-end reliability is limited by your business logic, not your parser. For SMEs running automations at meaningful volume (hundreds of invoices a day, thousands of tickets a month), that shift from 90% to 99% system reliability translates into fewer escalations, less rework, and more predictable cost.

Where will you actually meet it?

You will meet structured output in three SME patterns where it has immediate business value. The first is document data extraction: vendor name, invoice date, line items, and total, parsed directly into your accounting system. The second is support ticket classification, with category and priority returned as enums your routing logic recognises. The third is CRM-ready records, extracted from inbound emails or web forms in the exact shape your CRM expects.

In each case the AI output flows into a downstream system without human review. Schema enforcement removes an entire failure class. Your code stops carrying a “validate, retry, repair, escalate” branch and starts looking like the rest of your data pipeline: deserialise the response, pass the typed object to the next step.

The 2026 vendor surface is worth knowing before you build. OpenAI requires every field to be marked required (optional fields are simulated by allowing null) and supports a JSON Schema subset. Anthropic supports nullable fields more flexibly using {"type": ["string", "null"]}, and its strict tool use qualifies for Zero Data Retention with limited technical retention, which matters in healthcare and finance. Google Gemini requires explicit propertyOrdering in the schema and supports a slightly different JSON Schema subset. Each dialect has subtle differences, so a schema written for one provider may need a small rewrite to run on another.

When does it earn its keep, and when is it overkill?

Enable structured output when the AI response flows into another system without human review, when the schema has more than two or three fields, when you process more than a few hundred items a month, or when an error is expensive to fix downstream. A misclassified invoice that corrupts your ledger costs more in correction time than schema enforcement costs in tokens for a year. The decision is operational, not technical.

Skip it when the output is a single yes-or-no, when a human always reviews before action, when the work is exploratory and the schema is still being discovered, or when the volume is too low for the failure rate to add up. On a one-token classification, the JSON wrapper can be a 5x to 10x multiplier on the response. On a 500-token document extraction, the same wrapper is 2 to 3% of cost. The honest answer is that the overhead is context-dependent, and across realistic SME payloads it lands around 10 to 30%, well inside what the elimination of retries and rework pays back.

The cost story is also where the regex-and-pray pattern dies properly. Before structured output, teams prompted the model to “return JSON” and then wrote regular expressions to extract and repair the result. The pattern worked 70 to 85% of the time and accumulated edge cases for months: trailing commas, single quotes, code-fence wrappers, special characters in field values. Each new failure required a developer to update the regex and redeploy. Schema enforcement at the model level deletes that whole layer. You use a standard JSON parser (built into every language) and move on. One mid-market invoice automation reported a 387% first-year ROI on the switch, driven mostly by labour savings on data entry, error correction, and approval cycles.

Function calling (also called tool use) is structurally adjacent. The model returns parameters that match a function signature you defined, but the purpose differs. Function calling is for agent workflows where the model chooses which action to take, while structured output is for workflows where you specify the format and the model fills in the data. The function-calling explainer covers the agent-side cases.

Semantic validation sits one layer above. Schema validates syntax (is this valid JSON?) and shape (do the types match, is the enum value in range?). Semantics asks whether the values are correct: does the end date come after the start date, does the total match the line items, is the cited line actually in the source document? Structured output cannot guarantee semantic correctness on its own. For high-stakes extraction, layer a deterministic check on the values, ideally one that requires the model to quote the source verbatim alongside its extraction so a downstream rule can confirm the figure is actually present.

Hallucinations inside structured output are harder to spot than hallucinations in free-form text, not easier. Without a schema, a fabricated total looks like marketing copy and a human reviewer often catches it. Inside a schema, the same fabrication arrives as a valid number in the right field with a plausible confidence score, and silent-passes through to the ledger. Pair schema enforcement with sampling-based audits, especially in the first month after going live, and pair it with prompt instructions that allow null for absent fields rather than encouraging the model to invent.

The honest framing for any owner sitting opposite a vendor pitch on AI data pipelines is this. Ask whether the output is schema-enforced or just hopefully JSON. If the answer is hopefully JSON, expect the regex-repair layer and the 70-85% parse rate that goes with it. If the answer is schema-enforced strict mode, ask which provider, which JSON Schema subset, and how the team handles edge cases when the model gets confused. The vendor that can answer those three questions in a sentence each is the one whose pipeline will still be running cleanly a year from now.

Sources

OpenAI (2024). Introducing Structured Outputs in the API. The launch post for strict-mode JSON Schema enforcement on the API. https://openai.com/index/introducing-structured-outputs-in-the-api/ OpenAI (2026). Structured Outputs guide. The current developer reference for JSON Schema support, strict-mode rules, and required-field semantics. https://developers.openai.com/api/docs/guides/structured-outputs Anthropic (2026). Claude structured outputs documentation. The reference for JSON outputs and strict tool use, including nullable-field handling and Zero Data Retention eligibility. https://platform.claude.com/docs/en/build-with-claude/structured-outputs Google (2026). Gemini API structured output. The reference for response_schema, propertyOrdering, and the JSON Schema subset Gemini supports. https://ai.google.dev/gemini-api/docs/structured-output Cleanlab (2025). Benchmarking the reliability of structured outputs. Independent benchmarking of schema compliance and value-level accuracy across providers. https://cleanlab.ai/blog/tlm-structured-outputs-benchmark/ arXiv (2025). Constrained decoding for structured generation in language models. Peer-reviewed reference on the masking mechanism that makes invalid output mathematically impossible in strict mode. https://arxiv.org/html/2501.10868v1 Wiegold (2025). Building reliable invoice extraction prompts. A working invoice-extraction reference grounded in real schemas and failure modes. https://thomas-wiegold.com/blog/building-reliable-invoice-extraction-prompts/ AWS Machine Learning Blog (2024). Generate structured output from LLMs with dottxt outlines. AWS-published reference on constrained-generation patterns and schema-enforced JSON in production. https://aws.amazon.com/blogs/machine-learning/generate-structured-output-from-llms-with-dottxt-outlines-in-aws/ Vellum (2025). When to use function calling, structured outputs, or JSON mode. Practitioner comparison of the three patterns and when each is appropriate. https://www.vellum.ai/blog/when-should-i-use-function-calling-structured-outputs-or-json-mode Invoice Data Extraction (2025). Invoice automation ROI guide. Mid-market case data showing 387% first-year ROI on schema-enforced invoice automation. https://invoicedataextraction.com/blog/invoice-automation-roi-guide

Frequently asked questions

Does structured output stop the model hallucinating?

No. It guarantees the response is valid JSON and conforms to the schema you defined, so syntax and types are watertight. It does not check whether the values are true. A schema-valid invoice extraction can still return a fabricated total. For high-stakes use, layer a semantic check on top, for example requiring the model to quote the source line verbatim alongside the extracted figure so a downstream rule can confirm it actually appears in the document.

How much extra does it cost in tokens?

It depends on the shape of the response. On a single binary classification (positive or negative), the JSON wrapper can be a 5x to 10x multiplier on a tiny answer. On a 500-token document extraction with eight fields, the wrapper is roughly 2 to 3% of the output cost. Across realistic SME workloads with meaningful payloads, the average overhead lands at 10 to 30% in tokens, almost always offset by removing retries, regex maintenance, and manual rework.

When is structured output overkill?

When a human always reviews the output before action, when the work is exploratory and the schema is still being discovered, when the answer is a single yes-or-no on a tiny input, or when the volume is too low for parsing failures to add up. For one-line sentiment classification at a few hundred items a month, free-form output with defensive parsing is fine. The case for schema enforcement gets stronger as volume rises and as the output flows into another system unattended.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation