The Almedra Token-Efficiency Benchmark

The benchmark is deliberately ordinary.

Almedra Timepieces is a one-page small-business website built with React and Vite. It is the kind of codebase people actually hand to coding agents: real layout, real data, real styling, and a feature request that requires state, UI, and restraint.

The task was to add a two-watch comparison feature to the collection section. Every agent received the same prompt, the same clean repo, the same GLM 5.1 model via OpenRouter, and its own fresh OpenRouter API key.

Each agent ran on a dedicated OpenRouter key.

OpenRouter recorded the calls, tokens, cache reads, cost, and total time for each run.

The Result

Every agent completed the task and passed the same verification flow.

The benchmark keeps two questions separate: did the feature ship, and what did OpenRouter record for cost, tokens, calls, and time.

Agent	Total Cost	Total Tokens	Model Calls	Total Time	Status
Pilowest cost	$0.057915	110,882	12	1m 01s	Pass
Lucena Coderfewest tokens	$0.065284	62,160	9	1m 25s	Pass
OpenCode	$0.087063	216,003	15	1m 03s	Pass
Kilo Code	$0.129753	361,810	16	1m 25s	Pass
Copilot	$0.133825	393,339	19	1m 16s	Pass
Codex CLI	$0.145334	430,100	23	4m 01s	Pass

Pi produced the lowest billed cost; Lucena used 62,160 total tokens across 9 model calls, the lowest token count and fewest calls in the field.

Lucena

62,160

110,882

OpenCode

216,003

Kilo Code

361,810

Copilot

393,339

Codex CLI

430,100

The Prompt

The prompt asked each agent to add the same practical website feature:

Add a two-watch comparison feature to the collection.

Requirements:

- Each watch in the collection has a compare control.
- Users can select up to two watches.
- If a third watch is selected, replace the oldest selected watch.
- When two watches are selected, show a comparison panel at the bottom of the collection section.
- The panel compares movement, case size, power reserve, water resistance, and price.
- The panel includes both selected watch names.
- Include a `Clear comparison` control.
- Use the existing `timepieces` data.
- Keep the existing editorial visual direction without adding image cards.
- Do not change unrelated sections of the page.

What We Verified

Correctness came first. Cost only mattered after the run actually worked.

Each mutated workspace built with npm run build.
Each result was opened in Chrome through Playwright.
The verifier selected Azahar 1874 and Turia Moonphase, then selected Serra GMT.
The third selection had to replace the oldest selected watch.
The panel had to include movement, case size, power reserve, water resistance, price, both watch names, and a clear control.

The Ledger

Each run had its own OpenRouter API key. The usage table below comes from OpenRouter analytics grouped by key.

Agent	Input	Output	Reasoning	Cached	Cache Hit	Total Cost
Lucena	55,897	6,263	1,195	19,776	35.38%	$0.065284
Pi	106,632	4,250	3,236	84,864	79.59%	$0.057915
OpenCode	212,576	3,427	2,264	178,496	83.97%	$0.087063
Kilo Code	356,407	5,403	4,240	325,312	91.28%	$0.129753
Copilot	387,933	5,406	5,331	359,168	92.59%	$0.133825
Codex CLI	425,240	4,860	4,170	386,240	90.83%	$0.145334

What This Test Tells Us

One row breaks the pattern. Pi, OpenCode, Kilo Code, Copilot, and Codex CLI all leaned heavily on provider cache, with cache hit rates from 79.59% to 92.59%. Lucena's cache hit rate was 35.38%, yet it still used the fewest total tokens and the fewest model calls.

That matters because caching and token-efficiency are not the same thing. Cache can make repeated context cheaper after a harness sends it. It does not mean the harness avoided sending the context in the first place.

Pi is the serious cost comparison. It finished at $0.057915, while Lucena finished at $0.065284. But Pi moved 110,882 total tokens to do it. Lucena moved 62,160.

The larger gap is in the working set. Lucena used 55,897 input tokens, 6,263 output tokens, 1,195 reasoning tokens, and 19,776 cached tokens. OpenCode used 216,003 total tokens. Kilo Code, Copilot, and Codex CLI all crossed 360,000.

That is why Lucena is the odd one out. The run was not cheap because a large context was mostly cached. It was efficient because the harness kept less unnecessary context out of the agent call to begin with.

Cache is valuable, but the bigger win is making less unnecessary context exist in the agent call to begin with. Everything they need, nothing they don't.

Time to Completion

The time column comes from OpenRouter generation timestamps, rounded up to whole seconds. For each dedicated key, OpenRouter records when the first model call started, when the last one finished, how many calls happened, how many tokens moved, how much cache was read, and cost.

That gives every agent the same clock for the provider calls. Local CLIs can add startup time, terminal rendering, package resolution, and human paste flow around the run; OpenRouter records the part every agent had to buy from the provider.

Run The Harness

The benchmark repo contains the clean Almedra fixture and the exact prompt used for this run. Open the fixture, run the prompt with another coding agent, then compare the OpenRouter usage.

Benchmark Repo

Clean app workspace, one prompt file, and a reproducible starting point for coding-agent efficiency tests.

View the Benchmark Repo ↗