Reference checks for AI vendors, what to ask and how to verify

An owner at her office desk in late afternoon with a phone to her ear, taking handwritten notes on a printed list of reference questions, three vendor contact names listed in a notebook beside her, a closed laptop and a mug at the edge of the desk
TL;DR

Default AI vendor reference checks fail because the vendor picks the references and the references say nice things. The fix is to treat the call as one buyer talking peer-to-peer to another buyer who has lived through a decision you are about to make. Five questions produce real information: what surprised you, what would you change about scoping, where did actual costs land, how did they handle problems, would you renew. Verify beyond supplied names through LinkedIn, professional communities, software review aggregators, and named customers in vendor case studies.

Key takeaways

- Vendor-supplied references are a curated sample of the vendor's happiest customers. Asking better questions of biased sources still leaves you with biased sources. The fix is a different set of questions paired with a different set of sources. - The peer-to-peer reframe is the single highest-value move. You are not calling a stranger to ask them to criticise a vendor. You are a buyer learning from another buyer who has lived through what you are about to live through. The social dynamic that produces bland answers shifts when the frame shifts. - Five questions produce real information: what surprised you that the vendor did not warn you about, what would you change about your scoping, where did the actual costs land versus the proposal, how have they handled it when something went wrong, and would you renew or switch when your contract ends. - Verification beyond supplied references uses four independent channels: LinkedIn for non-supplied customers, professional community discussion on Slack, Discord, Reddit and trade forums, software review aggregators like G2 and TrustRadius, and cold outreach to customers named in the vendor's own case studies. - The output is patterns rather than verdicts. If five sources volunteer the same surprise, you have a vendor communication gap to investigate. If reference answers are uniformly bland generalities, that thinness is itself a signal worth weighting.

The owner of a 14-person professional services firm is forty-eight hours from signing a forty-thousand-pound twelve-month AI engagement. The vendor has given her three reference contacts. She has booked the calls. She has no script beyond “are you happy with them”, and a slightly sinking feeling that she is about to spend ninety minutes on the phone confirming what the vendor already told her.

That feeling is correct. Reference checks done the default way produce almost no information. The vendor selects the references. The references like the vendor or they would not still be customers. They have invested time, money and reputation in the decision, and they are not about to sit on a call with a stranger and concede they were misled. Asking better questions of biased sources does not solve the problem, because the sources are still biased. The fix is a different set of questions paired with a different set of sources.

What is a reference check that actually works?

A reference check that actually works is one buyer talking peer-to-peer to another buyer about a decision the second buyer has already lived through. It uses a small set of specific questions designed to surface gaps between the vendor’s pitch and the customer’s actual experience, and it triangulates the vendor-supplied references against independent contacts found through LinkedIn, community channels, review aggregators, and the vendor’s own published case studies.

The frame matters because it removes the social friction that makes the default conversation produce bland answers. You are not asking a stranger to criticise a supplier. You are a buyer learning from another buyer about scoping, costs, surprises and problem resolution. The vendor’s three-name list becomes one input in a wider picture, not the entire picture.

Why does it matter for your business?

It matters because the default approach has a fixed cost and a hidden one. The fixed cost is the time spent on calls that confirm the pitch. The hidden cost is the false sense of due diligence that licenses you to sign. Forrester’s work on B2B vendor evaluation and Gartner’s Voice of the Customer methodology both find that generic satisfaction questions correlate poorly with deployment success, renewal and actual customer experience.

The hidden costs that real reference work surfaces are also commercially material. BCG’s 2024 AI procurement survey and Deloitte’s State of Generative AI in the Enterprise both report common patterns of timeline slippage, underestimated data preparation, higher than expected staff retraining and integration complexity that did not appear in vendor proposals. The MIT Sloan piece on evaluating AI vendors makes the same point more pointedly. A buyer who has not surfaced these patterns before signing has agreed to absorb them.

What five questions produce real information?

Five questions, in this order, produce more information in three calls than ten calls of generic satisfaction probing. They work because they ask for descriptive experience rather than verdicts, and because their phrasing makes it socially comfortable to answer honestly rather than diplomatically.

The first question is what surprised you that the vendor did not warn you about. The phrasing is generous to the vendor and forensic about gaps in communication. The respondent can describe a longer implementation, a deeper data cleansing burden, or a heavier internal change-management load without feeling they are attacking the vendor.

The second is what would you change about your scoping if you started again. This shifts the locus of evaluation from “did the vendor mislead us” to “what would we do differently as buyers”, which is far less threatening and produces more honest reflection. Common answers cluster around tighter requirements before vendor engagement, harder negotiation on implementation support, and earlier involvement of end-users.

The third is where did the actual costs land versus the proposal. Be specific about categories. Software licence, implementation services, customisation, integration, training, ongoing support, internal staff time. Vendr’s practitioner research on SaaS total cost of ownership and the BCG AI procurement work both quantify the typical gap between proposed and actual.

The fourth is how have they handled it when something went wrong. Ask for a specific incident. Listen for how the vendor took ownership, the tenor of the communication, and whether the resolution fixed the root cause or only the symptom. The fifth is whether they would renew or switch when their contract ends, and why. Hedged renewals (“we probably will, but we’re looking at alternatives”) are weaker than confident ones and are themselves diagnostic.

How do you verify beyond the supplied references?

You verify by going to four independent channels, each with a different selection bias from the vendor’s. LinkedIn lets you search for the vendor name in employees’ work histories and reach out to customers the vendor did not put forward. A short, transparent message that names what you are evaluating and asks one specific question gets a surprisingly high response rate, especially from people with strong views.

Professional community channels carry candour that no reference call produces. Industry Slack groups, sector subreddits like r/sysadmin or r/ITManagers, trade forums and Discord servers all contain practitioner-to-practitioner discussion of specific vendors with no concern about being relayed back. Filter what you find by company size and industry similar to yours.

Software review aggregators like G2, TrustRadius and Capterra carry hundreds of reviews where the vendor’s list carries three. The rating distribution itself is informative. Polarised ratings suggest a vendor that works well in some contexts and fails in others. Uniformly five-star ratings suggest a curated review population rather than a genuine signal. Read the negative reviews carefully, filter by company size and industry, and look for repeated patterns rather than isolated complaints.

Cold outreach to customers named in the vendor’s published case studies is the fourth channel and the most under-used. Case studies are marketing, so the vendor’s framing is positive by design. The named customer is real, contactable on LinkedIn, and often willing to give twenty minutes to a peer making the same decision. The divergence between the case study’s framing and the customer’s actual account is often where the most useful information sits.

What do you do with the picture you assemble?

You convert it into patterns rather than verdicts. A single source mentioning a difficulty is anecdote. Five sources across independent channels mentioning the same surprise is a pattern that needs investigation before you sign. Map the answers across five buckets: timeline versus proposal, actual versus proposed cost, staff retraining burden, vendor responsiveness to problems, renewal intent. Where do sources agree, where do they diverge, what appears in independent sources but never in the vendor’s pitch.

Generalities themselves are data. If reference answers are uniformly bland, that thinness is its own signal. A vendor whose customers cannot offer a specific story when asked for one usually has a vendor relationship that has produced no specific stories worth telling. Absence is also data. If you ask six contacts about data quality and none of them can speak to it, ask their IT teams instead, because data quality issues are often handled invisibly by technical staff and never surface to business users until they become operational problems.

The output of all of this is not “this vendor is good” or “this vendor is bad”. The output is a clearer picture of what working with this vendor will actually look like in a firm the size and shape of yours, where the gaps in their pitch sit, and which contract clauses, scoping conversations or implementation commitments deserve harder negotiation before you sign. Three reference calls done this way are worth more than ten done by default. If three vendor proposals are open on your desk this week and the default reference process is the only diligence you have planned, book a conversation.

Sources

- Gartner (2024). Voice of the Customer methodology, the research framework establishing that generic satisfaction metrics correlate poorly with actual retention, renewal and deployment success and that diagnostic questions outperform satisfaction questions in B2B vendor evaluation. https://www.gartner.com/en/research/methodologies/voice-of-the-customer - Forrester (2024). B2B vendor evaluation best practices, the report emphasising the structural distinction between satisfaction metrics and diagnostic information about actual customer experience in technology procurement. https://www.forrester.com/report/b2b-vendor-evaluation-best-practices/ - Harvard Business Review (2018). The Art of the Reference Check, the practitioner reference on social desirability and the structural friction of asking a stranger to criticise a supplier they continue to use. https://hbr.org/2018/04/the-art-of-the-reference-check - McKinsey (2024). The procurement process, the operations research on systematic bias in supplier-selected references and the role of structured questioning in technology buying. https://www.mckinsey.com/capabilities/operations/our-insights/the-procurement-process - MIT Sloan Management Review (2024). How to evaluate AI vendors, the peer-reviewed piece setting out the specific failure modes of AI vendor evaluation including timeline slippage, hidden data preparation cost and integration complexity. https://sloanreview.mit.edu/article/how-to-evaluate-ai-vendors/ - BCG (2024). AI procurement survey, the primary research quantifying the gap between vendor implementation estimates and customer-reported actual timelines and costs in enterprise and mid-market AI deployments. https://www.bcg.com/publications/2024/ai-procurement-survey - Deloitte (2024). State of Generative AI in the Enterprise, the survey reporting on common patterns of underestimated data preparation, staff retraining and integration cost across AI implementations. https://www.deloitte.com/global/en/services/consulting/research/state-of-generative-ai-enterprise.html - Vendr (2024). Total cost of ownership for SaaS, the practitioner reference quantifying typical gaps between proposed software cost and all-in customer cost once implementation, integration and seat growth are included. https://www.vendr.com/blog/total-cost-of-ownership-saas - G2 (2024). Software review platform methodology, the reference on filtering aggregated software reviews by company size, industry and implementation timeline to read negative reviews diagnostically rather than dismissively. https://www.g2.com/categories/artificial-intelligence - Harvard Business Review (2019). How to Conduct Better Reference Checks, the practitioner reference on reframing reference conversations as peer learning rather than evaluation, and the increase in candour that follows from confidentiality commitments. https://hbr.org/2019/03/how-to-conduct-better-reference-checks

Frequently asked questions

How many reference calls is enough for an AI vendor decision at owner scale?

Three well-run calls beat ten run by default. The benefit comes from patterns across independent sources, not from a high call count with one biased source. Three vendor-supplied references plus two contacts found through LinkedIn or community discussion produces five viewpoints from different selection populations. If the patterns line up across those five, you have credible information. If they diverge sharply, you have identified the area that needs more digging before signing.

What if the vendor refuses to give references, or only gives one?

Treat thin reference provision as data. A vendor with strong customer relationships in your size band and industry can usually surface three contactable references inside a week. A vendor who cannot is either early-stage in your segment, dependent on a few large customers who do not look like you, or guarding access because the typical experience does not survive contact. Ask the vendor directly why the list is short. Then go to LinkedIn, community channels, and the customers named in their published case studies to triangulate.

Is it acceptable to record the reference call?

Only with explicit consent, and the request itself usually costs you candour. The reference is more open when the conversation feels confidential. Take notes during and write them up immediately after. Confirm at the start of the call that you will not attribute specific statements to the person you are speaking to, and that you are looking for patterns rather than quotes. That confidentiality is what makes the peer-to-peer reframe work; recording compromises it.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation