An owner is watching an AI tool demo. Every example is suspiciously clean. Every input is well-formed. Every output reads coherently. She wants to ask, “is this actually what I will get on my real data,” without souring the room or sounding like she is trying to catch the vendor out. The question is fair, and there is a way to ask it that keeps the conversation useful.
AI demos are inherently selected. The vendor shows their best example, on their cleanest data, with their cleanest prompt, on a fresh model that has not yet been confused by your edge cases. That is sales doing its job, not deception. The trouble is that the gap between a demo and a production deployment is wider for AI tools than it is for traditional software, and the numbers behind that gap are stark. Six questions, asked during the demo, narrow it.
Why are AI demos systematically misleading even when the vendor is honest?
Every demo starts with a choice about which examples to show, and that choice introduces selection effects that hide the system’s average performance behind its best. MIT’s NANDA initiative found 95 per cent of generative AI pilots fail to deliver measurable profit-and-loss impact, and IDC research with Lenovo found 88 per cent of proof-of-concept projects never reach production. The gap between curated examples and real operating conditions is the bottleneck.
The demo prompts have been refined across dozens of variations. The data has been cleaned. The model is fresh, with no accumulated context from your own messy inputs. The Air Canada case in 2024 made the consequences concrete: a chatbot confidently invented a bereavement fare policy that did not exist, the tribunal ruled the airline liable, and a demonstration that almost certainly passed internal review went on to produce hallucinated legal exposure in production.
Question one and two: real data, and a difficult example
The first question is, “can you run this on a sample of my own data right now.” Bring fifty anonymised records on a USB stick. A production-ready vendor will say yes, ask for a few minutes to load it, and walk you through both the successful outputs and the cases where the system flagged for review. A demo-only vendor will ask to take the sample offline.
Either response can be polite. The first is confident, the second is buying time. If a vendor consistently asks to take things offline, you are seeing the edge of their comfort zone, and that edge tends to be where you will spend the first three months of any deployment.
The second question is, “can you show me a difficult example you have actually encountered, not a clean one.” Production-ready vendors have these cases ready, because they monitor their deployments. They will show you a screenshot of a low-confidence output, explain why the system struggled, and walk you through what happened next. Demo-only vendors will tell you that difficult cases are outside the intended use case, or that customer privacy prevents sharing examples. The honest version of the privacy answer is, “I can show you a de-identified case from a comparable customer.” The evasive version is silence.
Question three and four: messy input, and your specific edge case
The third question is, “what happens when the input is messy or partial.” Real business data is incomplete. Customer records have empty fields. Supplier files arrive in formats nobody anticipated. A production-ready vendor will explain how the system handles missing data, show you a concrete example, and tell you what threshold of input quality triggers a human review. They will know the answer because they have already had to build for it.
The fourth question is the one that does the most work, and it requires preparation on your part. Bring a specific, difficult example from your own business. If you run lettings, ask how the system would handle a tenant with multiple payment methods and partially waived late fees. If you run a recruitment desk, ask how it would rank a candidate whose experience does not match any standard job title. The point of specificity is that vendor demos use generic examples, which are typical of what the system was trained on. Your atypical cases force the system onto unfamiliar ground, and the vendor’s response will tell you whether they have built for fallback or assumed everything will fit.
Question five and six: variance, and the failure mode
Ask the fifth question plainly. “Show me what happens when you ask the system the same thing twice in a row.” Language models sample from a probability distribution when they generate text, which means identical prompts produce different outputs. Thinking Machines Lab’s technical analysis sets out why this happens and how it can be dampened for production use.
A production-ready vendor will know the answer, will have set the system up for your use case, and will show you their monitoring of output variance. A demo-only vendor will be surprised by the question, which is the answer.
The sixth question is the one that matters most after you have signed. “What does failure look like, and how would I notice it in my own operations.” Every AI system fails. The question is whether you find out before the failure becomes a customer issue, a legal exposure, or a viral video. A production-ready vendor will walk you through specific failure modes with thresholds, dashboards, and escalation routes. A demo-only vendor will tell you that errors are rare and shift responsibility to your input data quality. The McDonald’s drive-thru system worked in controlled testing. It failed at 100 locations because the demo had never included background noise, regional accents, or a customer who changes their mind halfway through.
The pattern in the answers
Production-ready vendors and demo-only vendors give different answers in different registers. A production-ready vendor responds crisply, names limitations directly, shows you monitoring infrastructure, and welcomes the probing because customers who ask detailed questions are customers who succeed with the product. A demo-only vendor hedges with “typically” and “usually”, asks to take questions offline, and treats your questions as friction in the sales conversation.
Listen for the register, not just the content. Stanford’s analysis of agent deployments found that 89 per cent of enterprise AI agent implementations never reach production, with median project costs between 150,000 and 800,000 pounds. The vendors who survive that funnel are the ones who have already had the hard conversations about variance, governance, and failure with other buyers. They are not afraid of yours.
If you would like a second pair of eyes on a vendor demo coming up, or a fresh read on a pilot that has stalled, book a conversation.



