McDonald's IBM drive-thru: the pilot lesson SMEs miss

Two people leaning over a paper printout on a small table in a back-office room, one writing in pen, the other watching, late afternoon light
TL;DR

McDonald's tested IBM's voice-AI drive-thru at around 100 of its 40,000 locations for three years, then ended the pilot in July 2024 when accuracy stayed in the low-to-mid 80s against a 95% threshold for viability. The discipline worth copying is the explicit decision point and the willingness to stop, not the technology choice itself.

Key takeaways

- McDonald's piloted at roughly a quarter of one per cent of locations, not at scale, before deciding whether to roll out - The IBM AOT system reportedly ran at low-to-mid 80% order accuracy, against an estimated 95% threshold needed to justify replacing human staff - The valuable move was stopping in July 2024, not the technology choice in 2019 - SMEs over-extend pilots more often than they over-extend rollouts, because stopping is politically expensive even when it is financially cheap - Define your stop-or-scale threshold in writing, with a date, before the pilot begins, and agree what would have to be true for you to walk away

She has the budget. The vendor has been encouraging. The in-house enthusiast wants to move. The pitch deck she has been handed says the scaling decisions can be made later, once the pilot is up and running and the team is comfortable with the tool. She is sitting in front of the proposal, and the only thing she is genuinely unsure about is whether to commit to the bigger rollout now or hold the line on a narrower test.

That is the moment the McDonald’s IBM drive-thru story is for.

Most of the coverage treats the McDonald’s pilot as a punchline. Nine sweet teas, bacon on ice cream, $260 of chicken nuggets piling up on a viral video. It is funny, and the comedy is real, but the comedy is not the lesson. The lesson is what McDonald’s did once the numbers came back.

What actually happened with the McDonald’s IBM drive-thru pilot

McDonald’s tested IBM’s Automated Order Taker, a voice-AI system handling drive-thru ordering without a human, across more than 100 restaurants from 2019 to mid-2024. That is a slice of the roughly 40,000 sites in the global estate. Reported order-fulfilment accuracy sat in the low-to-mid 80% range against an estimated 95% needed to justify replacing human staff. In July 2024 McDonald’s ended the partnership.

That is the headline arc. A three-year test at fewer than a quarter of one per cent of locations. An honest read of the numbers. A clean decision to stop, with a separate statement that voice ordering would still be explored, including with Google Cloud.

The viral order errors got the airtime, but the franchise economics did the actual deciding. At low-to-mid 80s accuracy you are paying for the technology and still paying staff to catch the mistakes. The case for the tool, as priced and built, did not close.

Why the pilot discipline is the real story, not the technology choice

The technology choice is not what an SME owner should read McDonald’s for. The structure of the test is. McDonald’s did three things in sequence that many firms get wrong, and doing them kept a difficult voice-AI test from becoming a much larger rollout failure. Voice AI in noisy environments was always going to be hard, and other operators are running narrower versions of the same idea well.

They tested at small enough scale that stopping cost them learning, not the business. They measured against an explicit, economically grounded threshold, accuracy in the mid-90s, not a vague sense of “is this working.” And they had a decision point with a date attached, so the question of go or stop did not drift through another quarter of vendor optimism. The willingness to walk away in July 2024, three years and a public narrative in, is the move. Rolling the technology out to all 40,000 locations on the strength of vendor confidence is the failure that did not happen because the pilot was designed to make stopping possible.

Where this lands for an owner-managed firm

For an owner-managed firm running its first AI pilot, the parallel is direct even though the scale is not. The MIT NANDA report on generative AI in 2025 found about 95% of enterprise GenAI pilots stalled, with failures rooted in integration, governance and workflow fit rather than model quality. McKinsey’s 2025 State of AI survey put roughly two-thirds of AI-using organisations still piloting rather than at scale.

Stratify Insights’ 2026 benchmark estimated 60 to 80% of AI projects fail to reach production. The pattern is consistent. The technology often works. The pilots frequently do not. What the McDonald’s case adds is the discipline that prevents a stuck pilot from becoming a stuck rollout. Pilot scope small enough that stopping does not hurt the firm. Success metrics written down before the pilot starts, with the threshold for go and the threshold for stop sitting on the same page. A decision point with a date. A default, if the numbers do not hit threshold, of “stop and learn” rather than “extend the pilot another quarter and hope it matures.” Owner-managed firms over-extend pilots far more often than they over-extend rollouts, because the over-extension feels like prudence at the time.

The harder lesson, why stopping is politically expensive and financially cheap

Stopping costs you nothing on the balance sheet and a great deal in the room. You have to tell the vendor it did not work. You have to absorb whatever you have spent. You have to redirect the in-house enthusiast. You have to sit with the fact that the optimism in the original pitch deck did not survive contact with the numbers. The discomfort is the reason pilots frequently run too long.

The asymmetry is the bit to hold on to. The political cost of stopping is local, immediate and visible. The financial cost of not stopping is downstream, larger, and easier to defer. Sinch’s customer-communications research, reported in 2025, found that around 74% of firms running AI communication agents in production had already rolled back or shut down at least one on governance grounds, and the firms with stronger oversight were the ones doing it more often. The willingness to stop is a maturity signal, not a failure signal. McDonald’s took the politically expensive move in July 2024 and avoided a much larger financial one. Walmart did something similar later when its OpenAI Instant Checkout test came in at roughly a third of website conversion, and OpenAI rolled the feature back. The discipline travels.

What the case does not say, and what to do with it before you commit

The McDonald’s case does not say voice AI does not work. It does not say IBM’s technology was bad. It says the fit between this tool, this environment and this economic model was not viable at scale, and the right response to a wrong fit was stopping. The case is not an argument for caution as a default, it is an argument for clarity about when you will stop.

Apply the same discipline to the AI proposal sitting on your desk. Write down, before the pilot starts, what success looks like in numbers you can measure. Write down the threshold for stop, not just the threshold for go. Put a date on the decision point. Ask the question the McDonald’s leadership team had to answer in 2024, what would have to be true at the decision point for us to walk away, and is the team in this room willing to do it if that condition is met. If the answer is no, the pilot is already too big.

If you want to test that discipline on a live AI decision your firm is sitting on, book a conversation.

Sources

- Binance Square reporting (2024). McDonald's ends IBM AI drive-thru pilot in over 100 restaurants amid accuracy concerns. Used as primary source for pilot scope, end-of-partnership date, and the low-to-mid 80s vs 95% accuracy figure. https://www.binance.com/en/square/post/9653239819505 - Museum of Failure entry on the McDonald's IBM AOT partnership. Used for partnership announcement (2019) and pilot timeline. https://museumoffailure.com - MIT NANDA Initiative (2025). The GenAI Divide: State of AI in Business 2025. Used for the headline finding that about 95% of generative AI pilots stall, with failure rooted in integration and governance rather than model quality. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/ - McKinsey (2025). The State of AI in 2025. Used for the finding that nearly nine in ten organisations use AI but only about a third have scaled programmes, with the rest still in piloting. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai - Stratify Insights (2026). AI Project Failure Rate Benchmark. Used for the 60-80% production-failure estimate and the framing of AI debt as a governance and stewardship deficit. https://www.stratifyinsights.ai/ai-project-failure-rate - Sinch / TechRadar reporting (2025). About 74% of organisations have rolled back or shut down at least one AI agent on governance grounds, with mature firms doing it more often. Used to ground the claim that disciplined rollback is a maturity signal, not a failure signal. https://www.techradar.com - Reuters / CIO commentary (2025). Walmart and OpenAI Instant Checkout reversal. Used as a parallel example of a large operator measuring honestly and stopping when in-chat conversion came in at roughly a third of website conversion. https://www.cio.com/article/4168980/retail-ai-has-a-data-problem-heres-how-to-fix-it.html - Information Commissioner's Office. Guidance on AI and automated decision-making. Used as the UK-anchored reference for the principle that an operator remains responsible for the systems they deploy, including AI under pilot. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - UK National Cyber Security Centre (2023). Annual Review case study, AI in cyber security. Used to anchor the broader pattern that worst-case scenarios should be mapped at pilot design stage, not after capital has been committed. https://www.ncsc.gov.uk/collection/annual-review-2023/technology/case-study-cyber-security-ai

Frequently asked questions

Does McDonald's ending the IBM pilot mean voice AI does not work for drive-thrus?

No. Other operators are running voice AI in narrower contexts, and McDonald's itself said it would keep exploring voice ordering, including with Google Cloud on generative AI. The IBM pilot tells you that voice AI in a noisy, accent-varied environment was not yet reliable enough at McDonald's scale and economics. The reading is about pilot discipline, not a verdict on the technology.

How small should an SME pilot actually be?

Small enough that stopping does not hurt the firm. If walking away from the pilot would force you to defend a sunk cost in front of the board, the pilot was too big. McDonald's used about 100 of roughly 40,000 sites, near a quarter of one per cent. For a smaller firm, that often means one team, one workflow, one quarter, with a budget you would write off without flinching.

What does a written stop-or-scale threshold look like in practice?

One page. Three things on it. The success metric you will measure, the numerical threshold for go and the threshold for stop, and a dated decision point when the team will sit down and read the numbers honestly. McDonald's had a clear economic threshold of around 95% accuracy. Yours might be hours saved per week, error rate, or conversion lift. The discipline is fixing the numbers before the data starts arriving.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation