Back to notes

Benchmark / Token Efficiency

The Almedra Token-Efficiency Benchmark

Six coding agents. Same prompt. Same repo. Same OpenRouter model.

Lowest billed cost Pi
Input
106,632
Output
4,250
Cached
84,864
Total Cost
$0.057915
All agents used GLM 5.1
Provider
OpenRouter
Prompt
See Below
Repo
Agent-Efficiency-Bench

The benchmark is deliberately ordinary.

Almedra Timepieces is a one-page small-business website built with React and Vite. It is the kind of codebase people actually hand to coding agents: real layout, real data, real styling, and a feature request that requires state, UI, and restraint.

The task was to add a two-watch comparison feature to the collection section. Every agent received the same prompt, the same clean repo, the same GLM 5.1 model via OpenRouter, and its own fresh OpenRouter API key.

Each agent ran on a dedicated OpenRouter key.

OpenRouter recorded the calls, tokens, cache reads, cost, and total time for each run.

The Result

Every agent completed the task and passed the same verification flow.

The benchmark keeps two questions separate: did the feature ship, and what did OpenRouter record for cost, tokens, calls, and time.

Agent Total Cost Total Tokens Model Calls Total Time Status
Pilowest cost $0.057915 110,882 12 1m 01s Pass
Lucena Coderfewest tokens $0.065284 62,160 9 1m 25s Pass
OpenCode $0.087063 216,003 15 1m 03s Pass
Kilo Code $0.129753 361,810 16 1m 25s Pass
Copilot $0.133825 393,339 19 1m 16s Pass
Codex CLI $0.145334 430,100 23 4m 01s Pass

Pi produced the lowest billed cost; Lucena used 62,160 total tokens across 9 model calls, the lowest token count and fewest calls in the field.

Total Tokens Used

Lucena wins on token-efficiency.

Lucena
62,160
Pi
110,882
OpenCode
216,003
Kilo Code
361,810
Copilot
393,339
Codex CLI
430,100

The Prompt

The prompt asked each agent to add the same practical website feature:

Add a two-watch comparison feature to the collection.

Requirements:

- Each watch in the collection has a compare control.
- Users can select up to two watches.
- If a third watch is selected, replace the oldest selected watch.
- When two watches are selected, show a comparison panel at the bottom of the collection section.
- The panel compares movement, case size, power reserve, water resistance, and price.
- The panel includes both selected watch names.
- Include a `Clear comparison` control.
- Use the existing `timepieces` data.
- Keep the existing editorial visual direction without adding image cards.
- Do not change unrelated sections of the page.

What We Verified

Correctness came first. Cost only mattered after the run actually worked.

The Ledger

Each run had its own OpenRouter API key. The usage table below comes from OpenRouter analytics grouped by key.

Agent Input Output Reasoning Cached Cache Hit Total Cost
Lucena 55,897 6,263 1,195 19,776 35.38% $0.065284
Pi 106,632 4,250 3,236 84,864 79.59% $0.057915
OpenCode 212,576 3,427 2,264 178,496 83.97% $0.087063
Kilo Code 356,407 5,403 4,240 325,312 91.28% $0.129753
Copilot 387,933 5,406 5,331 359,168 92.59% $0.133825
Codex CLI 425,240 4,860 4,170 386,240 90.83% $0.145334

What This Test Tells Us

One row breaks the pattern. Pi, OpenCode, Kilo Code, Copilot, and Codex CLI all leaned heavily on provider cache, with cache hit rates from 79.59% to 92.59%. Lucena's cache hit rate was 35.38%, yet it still used the fewest total tokens and the fewest model calls.

That matters because caching and token-efficiency are not the same thing. Cache can make repeated context cheaper after a harness sends it. It does not mean the harness avoided sending the context in the first place.

Pi is the serious cost comparison. It finished at $0.057915, while Lucena finished at $0.065284. But Pi moved 110,882 total tokens to do it. Lucena moved 62,160.

The larger gap is in the working set. Lucena used 55,897 input tokens, 6,263 output tokens, 1,195 reasoning tokens, and 19,776 cached tokens. OpenCode used 216,003 total tokens. Kilo Code, Copilot, and Codex CLI all crossed 360,000.

That is why Lucena is the odd one out. The run was not cheap because a large context was mostly cached. It was efficient because the harness kept less unnecessary context out of the agent call to begin with.

Cache is valuable, but the bigger win is making less unnecessary context exist in the agent call to begin with. Everything they need, nothing they don't.

Time to Completion

The time column comes from OpenRouter generation timestamps, rounded up to whole seconds. For each dedicated key, OpenRouter records when the first model call started, when the last one finished, how many calls happened, how many tokens moved, how much cache was read, and cost.

That gives every agent the same clock for the provider calls. Local CLIs can add startup time, terminal rendering, package resolution, and human paste flow around the run; OpenRouter records the part every agent had to buy from the provider.

Run The Harness

The benchmark repo contains the clean Almedra fixture and the exact prompt used for this run. Open the fixture, run the prompt with another coding agent, then compare the OpenRouter usage.

Benchmark Repo

Clean app workspace, one prompt file, and a reproducible starting point for coding-agent efficiency tests.

View the Benchmark Repo