Skip to main content

Cost-and-Speed Budgets (v0.15.1+)

The eval harness uses cost as the primary execution gate, not wall-clock time. This lets cheap-but-slow models (like the open-source SOTA models on OpenRouter) get fair evaluation, while still preventing expensive models from running away.

Why cost > time

Before v0.15.1, the harness enforced wall-clock timeouts (60s opencode, 90-180s benchmark-level) as a proxy for cost control. This systematically excluded cheap-but-slow models — DeepSeek V4 Flash, Kimi K2.6, GLM 4.7 Flash — even though they cost fractions of a cent per benchmark.

Meanwhile expensive models (opencode-sonnet-4-6 at ~$0.20/call × multi-turn = $13.38 across 68 runs) were not effectively constrained by wall-clock — each call was fast, just expensive.

Cost is the right dimension. v0.15.1 makes it explicit.

How it works

Each *executor.Task carries a *CostBudget (nullable). The 5 executors (claude/opencode/gemini/codex/pi) call budget.Add(input, output) at their natural token-tally event point:

ExecutorHook event
claudestream-json usage event
opencodestream-json tool_result / step_finish event
geminipost-hoc at result event (no incremental usage from Gemini CLI)
codexturn boundary
piper-turn message_end event

When the running cost exceeds MaxCostUSD, the executor returns early with Result.CostKilledAt = current and stops reading the stream. This is a distinct failure category from api_error/timeout/logic_error in the dashboard.

Per-model budgets: block

Override defaults per-model in models.yml:

opencode-or-minimax-m2-7:
pricing:
input_per_1k: 0.0003
output_per_1k: 0.0012
budgets:
max_cost_usd: 0.30 # generous cost ceiling — 30¢ for cheap model
hard_timeout_secs: 600 # 10 min wall-clock SAFETY NET
expected_ttft_secs: 30 # for "abnormally slow" alerts (advisory)
expected_ttf_solution_secs: 90

When budgets: is omitted, defaults apply:

max_cost_usd = min($0.50, input_per_1k × 64 + output_per_1k × 32)
hard_timeout_secs = 600

Examples of resolved defaults:

ModelPricingDefault max_cost_usd
or-minimax-m2-7$0.30/$1.20 per 1M$0.058 (formula)
or-glm-5$0.60/$2.08 per 1M$0.105 (formula)
claude-opus-4-7$15/$75 per 1M$0.50 (clipped to ceiling)
ollama-codellama (free)$0/$0$0 (no enforcement)

Speed metrics in Result

The eval harness now records speed observables alongside cost:

FieldMeaning
FirstAttemptMsms from task start to first solution submission (first Write/Edit tool call or first text fallback)
SuccessAtMsms from task start to passing solution (-1 if not measured)
Turnsagent-mode turn count (already existed, now top-level)
TokensPerSecOutputTokens / generation_seconds
CostKilledAtrunning cost at kill time (0 if not killed)

These flow into the dashboard's efficiency block per model:

"efficiency": {
"median_time_to_first_attempt_ms": 8400,
"median_time_to_success_ms": 42000,
"median_turns_to_success": 3,
"median_tokens_per_sec": 45.2,
"p90_cost_per_success": 0.18,
"speed_efficiency_score": 0.73,
"cost_killed_count": 0
}

speed_efficiency_score = success_rate / (1 + median_TTS_seconds / 60) — a [0..1] score the dashboard uses to rank models on the cost-vs-speed Pareto frontier.

Dashboard charts

Two new charts visualize the new metrics:

  • Speed Radar — median time-to-success per model, outlier-clipped at 5× median (mirroring the v0.15.0 cost radar pattern)
  • Cost-Speed Frontier — Pareto scatter (x = $/success log scale, y = sec/success log scale, marker size = success rate, color = harness). Dominated points highlighted; frontier line drawn through optimal models.

The existing PerModelTrend chart gains a metric dropdown: Success Rate / Time to Success / Cost per Success.

Migration guide

For existing baselines (v0.14.x, v0.15.0):

  • Pre-v0.15.1 result JSONs lack cost_killed_at/first_attempt_ms/success_at_ms/tokens_per_sec fields → all default to 0
  • median_time_to_success_ms falls back to duration_ms for these baselines (ComputeEfficiency uses DurationMs when SuccessAtMs <= 0)
  • Dashboard Speed Radar will show realistic data for v0.15.1+ baselines; pre-v0.15.1 entries appear with full DurationMs (post-hoc) instead of true TTS

For new evaluations:

  • No action needed — the eval harness automatically populates Task.Budget from each model's resolved MaxCostUSD
  • Pass --agent-parallel N as before; budgets enforce per-Task, parallelism unchanged
  • Watch for cost_killed rows in the Top Error Codes table after large baseline runs

Disabling cost enforcement

To run with legacy wall-clock-only behaviour (e.g., debugging a stuck eval):

some-model:
budgets:
max_cost_usd: 0 # disable enforcement; passive token tally only
hard_timeout_secs: 60 # restore original 60s ceiling

MaxCostUSD == 0 makes CostBudget.Add() always return exceeded=false. Useful for back-compat testing and replay verification.

See also