You’re paying Claude Opus prices to label support tickets. Or you’re paying nothing and shipping Llama-quality customer emails. There’s a third path most teams either ignore or over-engineer into a six-month platform project. Multi llm routing — sending each task to the model that actually fits — is the obvious move, but nobody shows you the real numbers. So I ran 10 production tasks across Claude Opus 4.6, GPT-5.2, Gemini 3.1 Pro, and Llama 3.1 70B, then built the decision matrix.
What I Actually Tested (No Benchmarks, Real Tasks)
I’m not running MMLU on these. I ran the things I actually do in production: a customer refund email, marketing copy for a SaaS landing page, a SQL query against a 14-table schema, a Python refactor of a 200-line ETL script, data extraction from a messy 30-page PDF, summary of a 90-minute meeting transcript, classification of support tickets into 8 categories, structured JSON for a downstream pipeline, ad headline variations for an A/B test, and a multi-step reasoning prompt about pricing tradeoffs.
Same prompt across all four models. Temperature 0.3. Three runs each, averaged. I scored on three things only: would I ship this output, p50 latency in seconds, and cents per task at API list price.
The surprises showed up in the tasks I expected to tie.
Where Each Model Won (and By How Much)
Claude Opus took the writing tier — and the lead wasn’t subtle. Refund email, marketing copy, transcript summary all went to Opus. GPT-5.2’s marketing copy was technically correct and emotionally flat. Llama’s email read like a templated SaaS bot. Cost: ~6 cents per task on Opus, the highest in the test.
GPT-5.2 won reasoning. It was the only model that flagged a subtle race condition in the Python refactor without being told one existed. It also won the multi-step pricing prompt — the others tracked the math but missed the second-order effect on customer LTV.
Gemini 3.1 Pro took the messy PDF extraction (vision plus the long context did the work) and was the price-quality winner on SQL — same query quality as Claude at roughly a quarter of the cost.
Llama 3.1 70B via Groq won classification and structured JSON outright. Quality was indistinguishable from the premium models. Cost was ~20x lower. Latency was ~5x faster — sub-second on tasks that took Opus 4-6 seconds.
| Task type | Winner | Quality | Cost vs Opus | Latency vs Opus |
|---|---|---|---|---|
| Customer email | Claude Opus | Ship | 1x | 1x |
| Marketing copy | Claude Opus | Ship | 1x | 1x |
| Transcript summary | Claude Opus | Ship | 1x | 1x |
| Python refactor | GPT-5.2 | Ship | 0.6x | 1.2x |
| Multi-step reasoning | GPT-5.2 | Ship | 0.6x | 1.2x |
| PDF extraction | Gemini 3.1 Pro | Ship | 0.3x | 0.8x |
| SQL generation | Gemini 3.1 Pro | Ship | 0.25x | 0.7x |
| Ticket classification | Llama 3.1 | Tie | 0.05x | 0.2x |
| Structured JSON | Llama 3.1 | Tie | 0.05x | 0.2x |
| Ad headlines | Claude Opus | Ship | 1x | 1x |
The honest losers: nobody wants Llama writing the refund email; nobody should pay Opus prices to label tickets. The wins are real. The question is how to turn “Claude wins writing, Gemini wins extraction” into production code that picks the right one per call.
The Decision Tree (Copy This Into Your Router)
Three steps. Classify, route, escape hatch. That’s the whole pattern.
Step 1: Classify into four buckets. Writing, reasoning, extraction, structured/classification. A 50-token classifier call to Llama via Groq is cheap enough — roughly $0.0001 per request — to run on every call.
Step 2: Route by bucket. From the test, the default mapping is writing → Claude Opus, reasoning → GPT-5.2, extraction → Gemini 3.1 Pro, structured/classification → Llama 3.1.
Step 3: Add a quality escape hatch. If the cheap model’s self-reported confidence falls below threshold, retry on the premium tier. This catches the 5-10% of edge cases where the cheap model fumbles.
Here’s the actual pattern:
ROUTING = {
"writing": ("claude-opus-4-6", "gpt-5-2"),
"reasoning": ("gpt-5-2", "claude-opus-4-6"),
"extraction": ("gemini-3-1-pro", "claude-opus-4-6"),
"structured": ("llama-3-1-70b", "gpt-5-2"),
}
def route_and_call(task: str, prompt: str) -> str:
bucket = classify(task) # 50-token Llama call
primary, fallback = ROUTING[bucket]
response = call(primary, prompt)
if response.confidence < 0.75:
return call(fallback, prompt)
return response.text
The dict matters. Non-engineers on the team can edit the routing table without touching code — that’s how you keep this maintainable past the first month.
For your own decision matrix, the template is simple. Columns: your task types. Rows: quality required, latency budget, daily volume. The output cell is the model. Paste it into a doc, fill it in, and you have your routing table.
The math from my test set: routing cut total spend ~42% versus all-Claude across the 10 tasks, with no measurable quality drop on the outputs I shipped. If you’re already on Claude API with prompt caching, routing layers on top of that — they compound.
This works on paper. But routing adds a classifier call, a second API to monitor, and a fallback path that doubles your token cost when it fires. So when does the overhead eat the savings?
When NOT to Bother Routing (The Honest Threshold)
Under ~100K calls/month on a single task type, the math flips. Engineering and ops cost of maintaining a router beats the model savings. Stay on one model. The 42% I saved compounds at scale; at 5K calls/month, you’ll spend more developer time maintaining the router than you save on tokens.
Quality-critical paths — anything legal, medical, or attached to a customer’s name — pin to the premium model. The failure mode of a cheap-model miss is more expensive than every token you’d save in a year.
Latency-critical paths under 500ms can’t afford the classifier call. Route by metadata instead — endpoint, user tier, request size — not by content.
Small teams: start with a two-model split. Premium for writing and reasoning, cheap for everything else. The 80% of savings come from the first split. The four-way router is a v2 problem.
Monitoring is non-optional. Once you’re routing, you need per-model quality sampling — tools like LangSmith or Helicone handle this. Without it you’ll silently regress when a provider changes their model behind the same name. So now you know when to route, when to skip, and what to monitor — what’s the actual call to make Monday morning?
The Bottom Line
Yes, single-model is leaving real money or real quality on the table — the test proved it on 10 concrete tasks. The only question is how much engineering you can afford right now.
If you’re a small team: do the two-model split this week. Writing and reasoning to Claude or GPT-5.2. Everything else to Llama or Gemini. That’s the 80% win, and it takes an afternoon.
If you’re at production volume: build the four-bucket router with the quality escape hatch and per-model sampling. Budget two weeks.
If you’re below the volume threshold: stay on one model. Revisit when traffic 5x’s.
Pick your top three task types tonight. Run them through the four models tomorrow. Let the numbers — not the marketing — pick your router.