Six questions to ask in an AI demo before you sign anything

A business owner and a vendor representative sitting across a table with a laptop open between them, the owner pointing at the screen during a software demo
TL;DR

AI vendor demos run on cherry-picked data, optimised prompts, and a fresh model state, which is why 95 per cent of generative AI pilots fail to reach measurable production impact. Six questions, asked during the demo, surface the gap before you sign. They are not adversarial, they are diagnostic. Production-ready vendors answer them crisply with real examples. Demo-only vendors hedge, defer, or shift the conversation offline.

Key takeaways

- AI demos are systematically misleading even when the vendor is honest, because selection effects, polished prompts and a fresh model state hide the variance you will see on your own data. - 95 per cent of generative AI pilots fail to deliver measurable profit-and-loss impact, and 88 per cent of proof-of-concept projects never reach production, so evaluation discipline during the demo is your highest-return safeguard. - The six questions move the conversation from polished example to real data, difficult cases, messy input, your specific edge case, output variance, and observable failure modes. - A production-ready vendor welcomes these questions and answers with concrete examples, monitoring dashboards and governance thresholds, because customers who probe rigorously are customers who succeed with the product. - A demo-only vendor hedges, asks to take the data offline, refers to professional services for edge cases, and treats your questions as obstacles to the sale rather than confidence-building opportunities.

An owner is watching an AI tool demo. Every example is suspiciously clean. Every input is well-formed. Every output reads coherently. She wants to ask, “is this actually what I will get on my real data,” without souring the room or sounding like she is trying to catch the vendor out. The question is fair, and there is a way to ask it that keeps the conversation useful.

AI demos are inherently selected. The vendor shows their best example, on their cleanest data, with their cleanest prompt, on a fresh model that has not yet been confused by your edge cases. That is sales doing its job, not deception. The trouble is that the gap between a demo and a production deployment is wider for AI tools than it is for traditional software, and the numbers behind that gap are stark. Six questions, asked during the demo, narrow it.

Why are AI demos systematically misleading even when the vendor is honest?

Every demo starts with a choice about which examples to show, and that choice introduces selection effects that hide the system’s average performance behind its best. MIT’s NANDA initiative found 95 per cent of generative AI pilots fail to deliver measurable profit-and-loss impact, and IDC research with Lenovo found 88 per cent of proof-of-concept projects never reach production. The gap between curated examples and real operating conditions is the bottleneck.

The demo prompts have been refined across dozens of variations. The data has been cleaned. The model is fresh, with no accumulated context from your own messy inputs. The Air Canada case in 2024 made the consequences concrete: a chatbot confidently invented a bereavement fare policy that did not exist, the tribunal ruled the airline liable, and a demonstration that almost certainly passed internal review went on to produce hallucinated legal exposure in production.

Question one and two: real data, and a difficult example

The first question is, “can you run this on a sample of my own data right now.” Bring fifty anonymised records on a USB stick. A production-ready vendor will say yes, ask for a few minutes to load it, and walk you through both the successful outputs and the cases where the system flagged for review. A demo-only vendor will ask to take the sample offline.

Either response can be polite. The first is confident, the second is buying time. If a vendor consistently asks to take things offline, you are seeing the edge of their comfort zone, and that edge tends to be where you will spend the first three months of any deployment.

The second question is, “can you show me a difficult example you have actually encountered, not a clean one.” Production-ready vendors have these cases ready, because they monitor their deployments. They will show you a screenshot of a low-confidence output, explain why the system struggled, and walk you through what happened next. Demo-only vendors will tell you that difficult cases are outside the intended use case, or that customer privacy prevents sharing examples. The honest version of the privacy answer is, “I can show you a de-identified case from a comparable customer.” The evasive version is silence.

Question three and four: messy input, and your specific edge case

The third question is, “what happens when the input is messy or partial.” Real business data is incomplete. Customer records have empty fields. Supplier files arrive in formats nobody anticipated. A production-ready vendor will explain how the system handles missing data, show you a concrete example, and tell you what threshold of input quality triggers a human review. They will know the answer because they have already had to build for it.

The fourth question is the one that does the most work, and it requires preparation on your part. Bring a specific, difficult example from your own business. If you run lettings, ask how the system would handle a tenant with multiple payment methods and partially waived late fees. If you run a recruitment desk, ask how it would rank a candidate whose experience does not match any standard job title. The point of specificity is that vendor demos use generic examples, which are typical of what the system was trained on. Your atypical cases force the system onto unfamiliar ground, and the vendor’s response will tell you whether they have built for fallback or assumed everything will fit.

Question five and six: variance, and the failure mode

Ask the fifth question plainly. “Show me what happens when you ask the system the same thing twice in a row.” Language models sample from a probability distribution when they generate text, which means identical prompts produce different outputs. Thinking Machines Lab’s technical analysis sets out why this happens and how it can be dampened for production use.

A production-ready vendor will know the answer, will have set the system up for your use case, and will show you their monitoring of output variance. A demo-only vendor will be surprised by the question, which is the answer.

The sixth question is the one that matters most after you have signed. “What does failure look like, and how would I notice it in my own operations.” Every AI system fails. The question is whether you find out before the failure becomes a customer issue, a legal exposure, or a viral video. A production-ready vendor will walk you through specific failure modes with thresholds, dashboards, and escalation routes. A demo-only vendor will tell you that errors are rare and shift responsibility to your input data quality. The McDonald’s drive-thru system worked in controlled testing. It failed at 100 locations because the demo had never included background noise, regional accents, or a customer who changes their mind halfway through.

The pattern in the answers

Production-ready vendors and demo-only vendors give different answers in different registers. A production-ready vendor responds crisply, names limitations directly, shows you monitoring infrastructure, and welcomes the probing because customers who ask detailed questions are customers who succeed with the product. A demo-only vendor hedges with “typically” and “usually”, asks to take questions offline, and treats your questions as friction in the sales conversation.

Listen for the register, not just the content. Stanford’s analysis of agent deployments found that 89 per cent of enterprise AI agent implementations never reach production, with median project costs between 150,000 and 800,000 pounds. The vendors who survive that funnel are the ones who have already had the hard conversations about variance, governance, and failure with other buyers. They are not afraid of yours.

If you would like a second pair of eyes on a vendor demo coming up, or a fresh read on a pilot that has stalled, book a conversation.

Sources

- MIT NANDA initiative (2025). The GenAI Divide: State of AI in Business 2025. 95 per cent of generative AI pilots fail to deliver measurable profit-and-loss impact; vendor-purchased tools succeed 67 per cent of the time vs 33 per cent for internally built systems. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/ - Stanford HAI (2025). AI Index Report 2025. 233 documented AI safety incidents in 2024, a 56.4 per cent increase over 2023, including the Air Canada chatbot hallucination case. https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts - Civil Resolution Tribunal of British Columbia (2024). Moffatt v. Air Canada. Tribunal ruled Air Canada liable for chatbot misinformation about a bereavement fare policy that did not exist, ordering a partial refund. https://cloudsecurityalliance.org/blog/2024/06/05/the-risks-of-relying-on-ai-lessons-from-air-canada-s-chatbot-debacle - Museum of Failure (2024). McDonald's AI Drive-Thru. Documents the IBM-powered ordering system that misheard real customers (nine sweet teas, bacon on ice cream) across 100 locations before McDonald's terminated the partnership in July 2024. https://museumoffailure.com/exhibition/mcdonalds-ai-failure - The Markup, via Envive case study (2024). New York City MyCity Chatbot. Microsoft Azure-trained municipal chatbot gave systematically illegal advice on tip deduction, voucher discrimination, and cash acceptance. https://www.envive.ai/post/case-study-nycs-mycity-chatbot - Thinking Machines Lab (2025). Defeating Nondeterminism in LLM Inference. Technical analysis of why language models produce different answers to identical prompts, and how temperature and sampling parameters can be configured for production use. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ - CTO Advisor (2026). AI Doesn't Fail in the Demo, It Fails the First Time You Have to Trust It. Practitioner analysis of why demo autonomy feels like magic and production autonomy feels like risk, with governance and predictability framing. https://thectoadvisor.com/blog/2026/03/16/ai-doesnt-fail-in-the-demo-it-fails-the-first-time-you-have-to-trust-it/ - Fisher Phillips (2024). Essential Questions to Ask an AI Vendor Before Deploying Artificial Intelligence. Legal-counsel checklist of diligence questions covering data handling, edge cases, and contractual safeguards for AI buyers. https://www.fisherphillips.com/en/insights/insights/essential-questions-to-ask-ai-vendor-before-deploying-artificial-intelligence - Palavir (2024). AI Vendor Red Flags. Vendor evaluation framework covering security documentation, pricing predictability, real-data pilot resistance, and the demo-to-production gap. https://palavir.co/blog/ai-vendor-red-flags - Stanford analysis via AI Consulting Network (2026). AI Index 2026: Agents and the 66 Per Cent Production Gap. 89 per cent of enterprise AI agent implementations never reach production, with median project costs of 150,000 to 800,000 pounds. https://www.theaiconsultingnetwork.com/blog/stanford-ai-index-2026-agents-66-percent-production-gap-cre-investors

Frequently asked questions

Will asking these questions offend the vendor?

No. Good vendors expect them and have answers ready. The questions are vendor-neutral and diagnostic, not adversarial. You are not trying to break the system. You are trying to understand what you would actually be buying. If a vendor reacts defensively to a request to run your own data or show a difficult example, that reaction is itself useful information about how the relationship would run after you sign.

How long should the demo run if I ask all six questions?

Plan ninety minutes rather than the standard thirty. The demo script a vendor typically prepares runs around twenty minutes and is built to surface capability, not edge cases. Adding the six questions adds genuine work for the vendor, including loading your sample data, finding a difficult case from their archive, and walking through monitoring screens. If a vendor will not give you the time, that itself is a signal worth weighting.

What if I do not have a clean sample of my own data ready to share?

Bring something realistic rather than something polished. An anonymised export of fifty customer records, a recent batch of supplier invoices, or a week of support tickets is enough. The point of question one is not statistical validity. It is to see whether the system holds up on input the vendor has not pre-screened. Messy is the feature, not the bug. Strip personal identifiers, hand over the file, and watch what happens.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation