The benchmark is deliberately ordinary.
Almedra Timepieces is a one-page small-business website built with React and Vite. It is the kind of codebase people actually hand to coding agents: real layout, real data, real styling, and a feature request that requires state, UI, and restraint.
The task was to add a two-watch comparison feature to the collection section. Every agent received the same prompt, the same clean repo, the same GLM 5.1 model via OpenRouter, and its own fresh OpenRouter API key.
Each agent ran on a dedicated OpenRouter key.
OpenRouter recorded the calls, tokens, cache reads, cost, and total time for each run.
The Result
Every agent completed the task and passed the same verification flow.
The benchmark keeps two questions separate: did the feature ship, and what did OpenRouter record for cost, tokens, calls, and time.
| Agent | Total Cost | Total Tokens | Model Calls | Total Time | Status |
|---|---|---|---|---|---|
| Pilowest cost | $0.057915 | 110,882 | 12 | 1m 01s | Pass |
| Lucena Coderfewest tokens | $0.065284 | 62,160 | 9 | 1m 25s | Pass |
| OpenCode | $0.087063 | 216,003 | 15 | 1m 03s | Pass |
| Kilo Code | $0.129753 | 361,810 | 16 | 1m 25s | Pass |
| Copilot | $0.133825 | 393,339 | 19 | 1m 16s | Pass |
| Codex CLI | $0.145334 | 430,100 | 23 | 4m 01s | Pass |
Pi produced the lowest billed cost; Lucena used 62,160 total tokens across 9 model calls, the lowest token count and fewest calls in the field.
The Prompt
The prompt asked each agent to add the same practical website feature:
Add a two-watch comparison feature to the collection.
Requirements:
- Each watch in the collection has a compare control.
- Users can select up to two watches.
- If a third watch is selected, replace the oldest selected watch.
- When two watches are selected, show a comparison panel at the bottom of the collection section.
- The panel compares movement, case size, power reserve, water resistance, and price.
- The panel includes both selected watch names.
- Include a `Clear comparison` control.
- Use the existing `timepieces` data.
- Keep the existing editorial visual direction without adding image cards.
- Do not change unrelated sections of the page.
What We Verified
Correctness came first. Cost only mattered after the run actually worked.
- Each mutated workspace built with
npm run build. - Each result was opened in Chrome through Playwright.
- The verifier selected Azahar 1874 and Turia Moonphase, then selected Serra GMT.
- The third selection had to replace the oldest selected watch.
- The panel had to include movement, case size, power reserve, water resistance, price, both watch names, and a clear control.
The Ledger
Each run had its own OpenRouter API key. The usage table below comes from OpenRouter analytics grouped by key.
| Agent | Input | Output | Reasoning | Cached | Cache Hit | Total Cost |
|---|---|---|---|---|---|---|
| Lucena | 55,897 | 6,263 | 1,195 | 19,776 | 35.38% | $0.065284 |
| Pi | 106,632 | 4,250 | 3,236 | 84,864 | 79.59% | $0.057915 |
| OpenCode | 212,576 | 3,427 | 2,264 | 178,496 | 83.97% | $0.087063 |
| Kilo Code | 356,407 | 5,403 | 4,240 | 325,312 | 91.28% | $0.129753 |
| Copilot | 387,933 | 5,406 | 5,331 | 359,168 | 92.59% | $0.133825 |
| Codex CLI | 425,240 | 4,860 | 4,170 | 386,240 | 90.83% | $0.145334 |
What This Test Tells Us
One row breaks the pattern. Pi, OpenCode, Kilo Code, Copilot, and Codex CLI all leaned heavily on provider cache, with cache hit rates from 79.59% to 92.59%. Lucena's cache hit rate was 35.38%, yet it still used the fewest total tokens and the fewest model calls.
That matters because caching and token-efficiency are not the same thing. Cache can make repeated context cheaper after a harness sends it. It does not mean the harness avoided sending the context in the first place.
Pi is the serious cost comparison. It finished at $0.057915, while Lucena finished at $0.065284. But Pi moved 110,882 total tokens to do it. Lucena moved 62,160.
The larger gap is in the working set. Lucena used 55,897 input tokens, 6,263 output tokens, 1,195 reasoning tokens, and 19,776 cached tokens. OpenCode used 216,003 total tokens. Kilo Code, Copilot, and Codex CLI all crossed 360,000.
That is why Lucena is the odd one out. The run was not cheap because a large context was mostly cached. It was efficient because the harness kept less unnecessary context out of the agent call to begin with.
Cache is valuable, but the bigger win is making less unnecessary context exist in the agent call to begin with. Everything they need, nothing they don't.
Time to Completion
The time column comes from OpenRouter generation timestamps, rounded up to whole seconds. For each dedicated key, OpenRouter records when the first model call started, when the last one finished, how many calls happened, how many tokens moved, how much cache was read, and cost.
That gives every agent the same clock for the provider calls. Local CLIs can add startup time, terminal rendering, package resolution, and human paste flow around the run; OpenRouter records the part every agent had to buy from the provider.
Run The Harness
The benchmark repo contains the clean Almedra fixture and the exact prompt used for this run. Open the fixture, run the prompt with another coding agent, then compare the OpenRouter usage.
Benchmark Repo
Clean app workspace, one prompt file, and a reproducible starting point for coding-agent efficiency tests.
View the Benchmark Repo ↗