Claude Code vs Codex CLI vs Gemini CLI: What Each One Broke

May 7, 2026 · Maren Ishida

Every article comparing claude code vs codex cli vs gemini cli quotes the same suspicious benchmarks. One of them is literally a 404 right now — I clicked it twice to be sure.

So I stopped reading and ran my own. Same Express.js codebase. Same refactor request. Same prompt, pasted into a fresh terminal session for each. Three ai terminal coding agents 2026 reviewers keep ranking, one piece of real work, no edits to the prompt between runs.

Here’s what the timer said, what each tool quietly broke, and what the whole thing actually cost.

The Test: One Express.js App, Three Terminal Agents, Same Prompt

In a claude code vs codex cli vs gemini cli head-to-head, the only fair test is identical input. Claude Code finished the refactor fastest with zero human nudges. Gemini CLI cost the least at six cents. Codex CLI landed in the middle — slower to plan but stable on autonomous loops. Here’s how each one got there.

The starting point was a small Express.js API — about 600 lines, 8 routes, callback-style, no tests. The kind of code that sits in a sidebar of someone’s repo and quietly works until it doesn’t.

The prompt was four sentences: convert callbacks to async/await, add input validation with zod, write Jest tests for every route, keep behavior identical. That’s it. No nudging, no hand-holding, no “remember to also…” I pasted the same text into each tool with no edits.

Rules I set for myself: no interventions unless the agent got fully stuck. Log every nudge. Capture token usage from each provider’s dashboard the moment the run ended. Versions used: Claude Code (latest), Codex CLI on gpt-5.2, Gemini CLI terminal agent on gemini-3-flash. Each ran in its own fresh git clone to keep the state clean.

This isn’t an IDE comparison — it’s a terminal-based ai code editor test, which is a different kind of work. CLI agents have to navigate the codebase themselves; nothing is preloaded for them.

If the conditions are identical, the timing differences should be embarrassing for someone. They were.

Timing Results: Who Finished First (and Who Got Stuck)

Claude Code finished in 1 hour 9 minutes with zero nudges. It ran the Jest suite between edits and self-corrected two test failures before I could even react. The headline number is the time, but the more interesting one is the zero — I never had to step in.

Codex CLI took 1 hour 38 minutes with one nudge. It got stuck looping on a zod schema for a nested request body, trying variations of the same approach for about six minutes before I told it to flatten the validator and move on. Once it had a plan locked in, it was fast.

Gemini CLI took 2 hours 11 minutes with three nudges. It lost track of file paths twice. It regenerated package.json from scratch once because it convinced itself the existing one was broken (it wasn’t). Most of those extra minutes were spent re-reading files it had already read in the same session.

The real lesson isn’t the rank order. It’s that Claude Code’s edge wasn’t raw speed — it was the test-edit-test loop running automatically between rewrites. That’s a structural advantage, not a model advantage. Codex sat in a respectable middle, slow to plan but fast to execute. Gemini’s slowness came from a thin planning step that left it constantly relitigating decisions it had already made.

But finishing fast doesn’t matter if the diff is broken. So I read the diff.

What Each Tool Quietly Broke

Every comparison article tells you what these tools can do. None of them tell you what they break. Here’s what I found.

Claude Code silently swallowed the original next(err) error-passing pattern. It replaced the callbacks with try/catch blocks but forgot to call the Express error middleware in two routes. The Jest tests still passed because the tests didn’t cover error paths — and the new code was structurally cleaner, so I almost missed it. Subtle failures hide in places your tests don’t look.

Codex CLI wrote technically correct zod validators, then moved required-field checks from the middleware into individual route handlers. The validation logic was right. The error response shape changed — what used to return a 400 with a specific error envelope now returned zod’s default formatted error. Any client depending on the original payload would silently break in production.

Gemini CLI regenerated package.json and dropped three dev dependencies — eslint, prettier, husky — because it considered them “unused.” They’re not unused. They’re loaded by config files Gemini didn’t read. The whole thing only surfaced when CI failed two days later.

The pattern that emerged is uncomfortably clean: Claude broke things where tests didn’t look. Codex broke API contracts. Gemini broke tooling boundaries. Three different failure modes, all invisible to the tools’ own self-checks. For more on autonomous coding agent failures, the same pattern shows up across web-based agents too. Reading the diff catches all three. Trusting the green checkmark catches none of them.

Which makes value-per-dollar the next real question — and the cost numbers are not what I expected.

The Token Cost Nobody Calculates

Claude Code burned $0.41 in API spend for the whole refactor. Mostly input tokens — it re-read files often, but it did it efficiently and edited surgically. At these levels, when cost optimization actually matters is a question of volume, not per-run spend.

Codex CLI cost $0.19. More output tokens than Claude (it rewrote bigger chunks at once), but the per-million pricing on gpt-5.2 is cheap enough that it still came in at under half. The codex cli review headlines that talk about “stability and price” actually hold up here.

Gemini CLI cost $0.06. Six cents. Gemini 3 Flash pricing plus Google’s free tier covered most of the run. I checked the dashboard twice because I didn’t believe it.

That’s a 7x gap between the priciest and cheapest tool for a refactor that needed roughly the same human review afterward. If you’re running agents at scale, the cost optimization strategies worth pursuing are the ones that reduce input token churn, not output volume. If you’re going to read the diff carefully anyway — and you should, since LLM token counting never matches your bill — Gemini’s six-cent run is hard to argue against. The time you save not babysitting Claude isn’t worth $0.35 unless your hourly rate is high.

The math flips for autonomous overnight loops, where Codex’s loop stability becomes the deciding factor. Among autonomous coding terminal tools, Codex is the one I’d trust to run unattended.

The Verdict: Pick Based on What You’re Actually Doing

Use Claude Code when the work is hands-on, the codebase is unfamiliar, and you need clean code on the first try. The price is worth it for the tight feedback loop. For a broader IDE comparison, the competitive landscape shifts depending on whether you’re in the terminal or the editor.

Use Codex CLI when you want to walk away and let it run. It’s the most stable on autonomous loops, and its failure mode — API contract drift — is the easiest to catch in code review.

Use Gemini CLI when the task is well-scoped, you’ll review the diff anyway, and cost matters. The six-cent refactor is real, and the speed penalty is mostly planning, not output.

If you’re searching for the best cli ai coding tool for your workflow, this claude code vs codex cli vs gemini cli comparison comes down to one question: are you babysitting the agent or letting it run overnight?

Personally? Claude Code for new code. Gemini CLI for edits to code I already understand. Codex CLI for overnight chores.

Now clone a small project of your own and paste your real prompt into whichever two you’re choosing between. Watch the diff. Every benchmark article is downstream of someone’s bias. Including this one.