We didn't plan to switch.
When we first built the AI voice agent for our client — a mid-sized Italian company handling roughly 3,000 inbound calls per month — Gemini was the obvious pick. Good pricing, fast API, solid Italian language support. We scaffolded the whole pipeline in about six weeks: Deepgram for ASR, ElevenLabs for voice synthesis, Vapi as the orchestration layer, and Gemini 1.5 Flash as the brain.
It worked. Kind of.
Three months into production, we started seeing a pattern in the support tickets. Callers were hanging up — about 8% mid-conversation. In a call center context, that's a lot of lost revenue. Our client started asking questions we couldn't answer with clean data.
This post is the honest account of what we found, what we did about it, and what the migration from Gemini to Claude API actually cost us — in money, engineering hours, and downtime risk.
Before we get into the switch, it's worth explaining why we picked Gemini in the first place. This isn't a "Gemini is bad" story — it's a context story.
The client's primary use case: inbound qualification calls for B2B leads. The AI needed to understand intent, ask follow-up questions, collect structured data, and handle frustrated or off-script callers without breaking the conversation.
Gemini 1.5 Flash was fast. LLM latency hovered around 400–600ms, which, combined with Deepgram's 200–300ms transcription, kept end-to-end voice latency under one second. Anything over 1.2–1.5 seconds and the conversation starts feeling broken.
The first signal wasn't dramatic. Our client sent a Slack message in month two: "Some people are complaining the bot doesn't understand them when they go off topic."
The transcripts were clean — Deepgram was doing its job. The issue was downstream: Gemini was producing technically correct responses but contextually flat ones. When a caller deviated from the expected flow — an angry remark, a sarcastic comment, an Italian colloquial expression — the model either ignored the subtext or produced a response that felt robotic in the worst possible way.
Caller: "Look, I've been transferred three times already, I'm exhausted."
Gemini response: "I understand. Can you tell me the name of your company?" Not wrong. Just... wrong.
Over the following weeks we documented three distinct failure categories:
The model treated frustrated callers the same as neutral ones. No acknowledgment, no tone shift, no recovery language. Callers who were slightly annoyed became significantly more annoyed after interactions like the one above.
Gemini's Italian was grammatically correct but texturally foreign. Phrases were produced with a syntactic pattern that Italian native speakers associate with translation, not natural speech. Subtle but cumulative — by minute 3, callers were unconsciously aware something was off.
When a caller's intent didn't map cleanly to one of our defined branches, Gemini would either loop the previous question or make a low-confidence assumption and proceed. Edge cases — roughly 15% of calls — were being handled poorly.
| Metric | Result | Target | Delta |
|---|---|---|---|
| Call completion rate | 88.2% | 92% | −3.8pp |
| Successful qualification rate | 71.0% | 80% | −9pp |
| Average call duration | 4 min 38 sec | ≤4 min | +38 sec |
| Caller satisfaction (SMS survey) | 3.4 / 5 | 4.0+ | −0.6 |
We ran a three-week internal evaluation before recommending the migration. Each model was tested on 200 synthetic call transcripts: 60% standard flows, 25% off-script, 15% adversarial (frustrated, confused, or deliberately evasive callers).
| Model | Latency (p50) | Italian quality | Tone handling | Est. cost/call |
|---|---|---|---|---|
| Gemini 1.5 Flash | ~480ms | Good | Weak | ~$0.022 |
| GPT-4o mini | ~520ms | Good | Medium | ~$0.028 |
| Claude 3.5 Haiku | ~390ms | Excellent | Strong | ~$0.031 |
| Claude Sonnet 4 | ~680ms | Excellent | Excellent | ~$0.058 |
Claude 3.5 Haiku won on the combination of latency, Italian naturalness, and what we internally called "graceful degradation" — what the model does when it doesn't have a clean answer. Instead of proceeding blindly or looping, it would surface the ambiguity naturally:
"Scusa, vuoi dire che il problema è principalmente con i tempi di consegna, o riguarda qualcos'altro?" (Sorry, do you mean the problem is mainly with delivery times, or is there something else?)
That one behavior — asking a clarifying question like a human would — was worth more than any benchmark score. We went with Haiku for the primary path and kept Sonnet 4 as a fallback for unresolved escalation flows.
This is the part most blog posts skip.
Your Gemini prompts don't port cleanly to Claude. Gemini with a medium-length system prompt tends to be literal and task-focused. Claude with the same prompt will try to infer intent more aggressively — usually good, but occasionally it over-elaborates in a voice context, adding 40 words instead of 20 and destroying your TTS latency budget.
We rewrote the system prompt from scratch. Key changes:
Prompt rewrite: ~28 engineering hours.
Swapping the LLM in Vapi is, in theory, a config change. In practice: re-validate every webhook, re-test latency under load, re-tune interruption detection thresholds (Claude's token rhythm differs from Gemini's, which affects how Vapi decides the model is "done speaking").
Integration work: ~14 engineering hours.
Before cutting traffic over, we ran Claude in shadow mode — receiving the same inputs as Gemini but not producing TTS output. We compared responses manually on 300 real calls over seven days. We caught three prompt edge cases we hadn't anticipated, fixed two, and documented one as acceptable behavior.
Shadow testing: ~8 engineering hours + 12 hours QA.
10% → 30% → 70% → 100% over four days. We monitored call completion rate, average duration, and DTMF escape rate in real time. No rollback was needed. By day 3 at 70% traffic, the metrics were already pointing the right direction.
Total: approximately 62 hours across two senior engineers and one QA specialist. Client billing: €5,800 fixed price, agreed upfront.
| Line item | Gemini 1.5 Flash | Claude 3.5 Haiku |
|---|---|---|
| Input (per 1K tokens) | $0.075 | $0.080 |
| Output (per 1K tokens) | $0.300 | $0.400 |
| Avg tokens per call | ~2,500 | ~2,200 * |
| Cost per call (LLM only) | ~$0.022 | ~$0.027 |
| Monthly (3,000 calls) | ~$66 | ~$81 |
| Monthly delta | — | +$15/month |
* Claude's responses are more concise once the prompt is correctly tuned, which partially offsets the higher per-token cost.
Zero. Shadow mode + gradual rollout meant the client never experienced a degraded state. This was the part the CTO cared about most.
| Metric | Gemini baseline | Claude (60-day avg) | Change |
|---|---|---|---|
| Call completion rate | 88.2% | 93.7% | +5.5pp |
| Successful qualification rate | 71.0% | 81.4% | +10.4pp ↑ |
| Average call duration | 4:38 | 4:02 | −36 sec |
| Caller satisfaction (SMS survey) | 3.4 / 5 | 4.1 / 5 | +0.7 |
| Human escalation rate | 14.3% | 9.1% | −5.2pp |
| DTMF escape ("press 0") rate | 6.8% | 3.2% | −3.6pp |
In text, a slightly robotic response is tolerable. In voice — where the human brain detects inauthenticity in milliseconds — it's fatal to the conversation. This is the most underrated evaluation dimension when picking a voice AI model.
A 1,200-token system prompt with Claude produces a measurably different latency profile than a 400-token one. For voice, keep your prompt lean and your few-shot examples tight. Latency budgets don't forgive bloated context.
We could have cut over in week two. We didn't. Shadow testing found three bugs that would have been ugly in production. If you're migrating a live voice system, budget for shadow mode.
Every major LLM will tell you they support Italian. The gap between "supports" and "sounds natural" is large and only visible in production. If you're building for a non-English market, evaluate specifically on your target language with real native-speaker listeners — not automated metrics.
Our $15/month LLM increase was irrelevant to the business case. The 62 engineering hours and the 60-day measurement period were the real costs. Plan for those.
The migration wasn't technically complex. What it required was honesty about what the original system wasn't doing well, a rigorous evaluation process, and a careful rollout. Those three things are always the hard part — not the API call.