How to Run a Conversational AI Demo (Without Getting Sold)
The exact playbook for evaluating conversational AI vendors, how to bring your own data, what to test, and the questions that separate honest demos from polished pitches.
Vendors will run their best demos for you. They'll pick the ticket categories they handle well. They'll pre-configure responses. They'll show you the happy path, every time.
Your job in a demo isn't to be sold to. It's to figure out what the product does, in production, on your real tickets. This is the playbook.
Before the demo: what to bring
Three things to prepare:
1. 30–50 anonymized real tickets
Pull from the last 30 days. Mix categories, some WISMO, some refunds, some complaints, some weird ones. Strip PII (names, emails, addresses, payment info). Save as a simple list with: customer message, ticket category, what your team did.
2. A list of your top 5 ticket categories with volume estimates
Vendors love to talk about average resolution rates. You want category-specific numbers. Send this in advance: "Our inbound is 45% WISMO, 12% returns, 8% refunds, 6% subscription edits, 5% account issues, what's your AI's resolution rate on each?"
3. A sandbox of your store
Shopify, WooCommerce, Stripe, whatever you use. Set up a test account or sandbox environment. You'll ask the vendor to connect to it during the demo.
Six tests to run in the demo
The structure: 30–45 minutes total, 5–7 minutes per test. Don't let the vendor spend the first 20 minutes on company history.
Test 1: phrasing variation (5 min)
Take one ticket category and ask it three different ways. "Where's my order?" / "Did my package ship yet?" / "Tracking update on my recent order?"
- Pass: the AI handles all three identically and pulls the right data.
- Fail: one or more of the three confuses the AI, asks for clarification it shouldn't need, or returns a different answer.
Test 2: real action execution (5 min)
Ask the AI to do something, process a refund, update an address, pause a subscription. Watch what happens.
- Pass: the action executes (you see the refund in Stripe, the address change in Shopify, etc.). The customer-facing reply confirms it specifically ("Refunded $X to your Visa ending 4242").
- Fail: the AI says it "would" take the action but doesn't, or the action requires a human to approve, or the response is generic ("I've started the refund process") without confirmation that anything happened.
Test 3: integration setup live (5 min)
Hand the vendor your sandbox credentials. Ask them to connect during the call.
- Pass: connection happens in the demo. The AI starts pulling data from your sandbox within minutes. You see your sandbox orders appearing in the AI's responses.
- Fail: vendor needs to "schedule a follow-up for integration" or says "our team will handle that during onboarding." Translation: their integration story is harder than they're claiming.
Test 4: escalation behavior (5 min)
Send a message that should escalate. Try anger ("This is the THIRD time I've contacted you and nothing has happened, this is unacceptable"). Try legal language ("I'm considering a chargeback dispute"). Try ambiguous data ("I returned an item but never got a refund").
- Pass: the AI escalates without trying to handle it. The escalation includes context for the human, the conversation, the customer data, a flag for why it escalated.
- Fail: the AI tries to deflect with sympathy phrases, or asks the customer for more info instead of escalating, or escalates without context (just "Customer needs help").
Test 5: multilingual (3–5 min)
Send a message in a non-English language. Spanish, French, German, pick one your customers use.
- Pass: the AI detects the language and replies in it natively. Pulls data from your (English) knowledge base, translates on the fly.
- Fail: AI replies in English. Or asks the customer to switch languages. Or escalates because it "doesn't support that language."
Test 6: failure transcript (5 min)
Ask the vendor: "Show me the last 100 tickets your AI didn't handle, across all customers, with the reasons each was escalated."
- Pass: vendor shows you a real list. Categories of escalation. Common patterns. They can talk through what their AI is bad at without flinching.
- Fail: vendor can't or won't show this. Or shows you only the easy escalations (sentiment-based, legal language) and avoids showing the cases where AI just got confused.
If the vendor genuinely doesn't have a failure transcript ready (some don't audit their own escalations, itself a yellow flag), have a fallback question ready: "Tell me about the last bug or quality issue a customer complained to you about." Listen for specificity. A confident "we've never had a complaint" is the same red flag as not showing the transcript, every production AI has had a complaint, and confident vendors have stories about how they fixed them. Vague answers or deflection are the signal.
This last test is the most important. Vendors who hide failures have something to hide.
The questions that separate good vendors from bad
Beyond the live tests, six questions that quickly reveal vendor quality:
"What's your confidence threshold below which the AI escalates?"
- Good answer: A specific number like 0.65, with explanation of how it's tuned per category, configurable by the customer.
- Bad answer: "Our AI is really accurate so we don't need a threshold." (Translation: no quality gate.)
"Show me a ticket where your AI was wrong and a customer complained."
- Good answer: They have one ready. They walk through what happened, what they fixed.
- Bad answer: "We've never had a complaint." (Lying.) Or "Let me get back to you on that." (Stalling.)
"What's your hallucination rate on a third-party benchmark?"
- Good answer: A specific number with the test set referenced. Explanation of what they do to keep it low.
- Bad answer: "We use [LLM name], it doesn't hallucinate." (No vendor's tool is hallucination-free; if they claim that, they're lying.)
"What integrations are pre-built vs custom?"
- Good answer: A specific list. "Shopify, Stripe, Klaviyo, WooCommerce, Recharge are native. Beyond that, custom integrations cost $X and take Y weeks."
- Bad answer: "We can integrate with anything." (Translation: nothing is pre-built and everything will be a project.)
"What does setup look like for a brand at our volume?"
- Good answer: Specific timeline. "Day 1: connect Shopify and Stripe, point at your help center. Day 2–3: review draft replies in shadow mode. Week 2: turn on autonomous for WISMO."
- Bad answer: "Setup varies. Our customer success team will scope it during onboarding." (Translation: setup will take longer than they want to admit.)
"How do I cancel?"
- Good answer: Month-to-month, 30-day notice, full data export.
- Bad answer: Annual contract with no early termination, complex cancellation procedures, "we'll work with you to find what's not working." (Translation: locked in.)
Red flags during the demo
Things that should make you walk away:
- The vendor never says "we can't do that." Every real product has limits. Vendors who say yes to everything are setting up for disappointment in production.
- Every customer message in the demo is short and clean. Real customers send paragraph-long emotional messages with typos. Demos that only show clean messages are hiding edge cases.
- The integration "happens off-screen." They navigate to a separate tab and tell you "it's connected now." Watch the integration happen. If they hide it, there's a reason.
- The AI uses generic responses. "I'd be happy to help!" / "Thanks for reaching out!", these phrases in every reply suggest template-based replies, not real LLM generation.
- No CSAT or quality data shown. A confident vendor shows you real CSAT numbers from production customers. Hiding this means it's not impressive.
What to ask after the demo
If you're seriously evaluating, follow up with:
Send 50 anonymized tickets, ask for predicted outcomes
After the demo, email the vendor 50 of your real tickets. Ask: "For each of these, predict what your AI would do, resolve, escalate, get wrong. Send me the answer."
The honest vendors give specific predictions. The dishonest ones say "we'd resolve all of them." When you compare predictions to real outcomes (after pilot), you'll know who lied.
Ask for 3 reference customers in your space
Specifically: brands at similar volume, similar category, that have been live for 6+ months. Ask the vendor to introduce you. Then ask the references the questions vendors won't answer:
- "What's your actual autonomous resolution rate today?"
- "Where does the AI struggle?"
- "How often does the vendor's support team respond when you have an issue?"
- "If you were starting over, what would you do differently?"
Get the contract terms in writing before signing
Specifically:
- Pricing model (per-resolution / per-ticket / per-seat) and the unit rate
- Minimum monthly commitment
- Overage rate and cap
- Cancellation terms
- Implementation timeline with specific milestones
- What happens to your data if you cancel
A demo evaluation scorecard
Run this scorecard for every vendor demo:
| Test | Pass / Fail | Notes |
|---|---|---|
| Phrasing variation | __ | |
| Real action execution | __ | |
| Integration setup live | __ | |
| Escalation behavior | __ | |
| Multilingual | __ | |
| Failure transcript | __ | |
| Confidence threshold answer | __ | |
| Wrong-AI ticket example | __ | |
| Integrations list specific | __ | |
| Setup timeline specific | __ |
10 passes = strong vendor. Take to next stage. 6–9 passes = mixed. Probe the failures. Below 6 = either chatbot dressed as AI, or not ready for production. Drop.
What we'd ask of you
If you're evaluating us at Ensoras, hold us to this same standard. Bring your tickets. Run the six tests. Demand the failure transcript. We'd rather lose a deal than win one we can't deliver on.
Book a demo and we'll connect to your sandbox during the call. If we fail any of the six tests above, tell us, we want to know.
Sources
- Forrester, Predictions 2026: AI Gets Real For Customer Service, But It's Not Glamorous Work, analyst context on what production AI deployments look like behind the demo polish.
- Anthropic, Building effective agents, model-provider research on the architectural patterns that determine whether a conversational AI works in production (which is what these tests are actually probing).
- CBC News, Air Canada found liable for chatbot's bad advice, 2024 case showing what happens in production when an AI ships without confidence thresholds or human escalation. Worth keeping in mind when a vendor claims their AI never gets things wrong.
Frequently asked questions
How long should a good AI demo take?
30–45 minutes if you bring tickets and run structured tests. Vendors will try to extend to 60–90 minutes with company history and product overview, politely decline. Your time is better spent watching the AI work on real tickets than watching slides.
Should I let the vendor pick the demo tickets?
No. Vendors will pick tickets their AI handles well. Bring 30–50 anonymized real tickets from your own queue. You'll learn 10x more about the product.
What if the vendor says they need 'preparation time' to handle my tickets?
That's a red flag. Modern AI doesn't need ticket-specific preparation, it should handle them out of the box (with your knowledge base connected). 'Preparation time' usually means they're going to manually configure responses for the demo, which won't reflect production behavior.
How do I tell if a demo is being staged?
Three signs: the AI never asks for clarification, every action succeeds first try, the responses sound suspiciously polished. Real production AI sometimes asks for missing info, sometimes hits an integration timeout, sometimes generates a slightly off-tone reply. A perfect demo is usually a staged demo.
What should I never agree to in a first demo?
Don't sign anything. Don't accept exclusive evaluation periods (where you can't talk to other vendors). Don't accept implementation timelines that depend on undefined deliverables ('we'll figure out the integration plan later'). All of these are tactics to lock you in before you've evaluated alternatives.