Cost-and-Speed Budgets (v0.15.1+)
The eval harness uses cost as the primary execution gate, not wall-clock time. This lets cheap-but-slow models (like the open-source SOTA models on OpenRouter) get fair evaluation, while still preventing expensive models from running away.
Why cost > time
Before v0.15.1, the harness enforced wall-clock timeouts (60s opencode, 90-180s benchmark-level) as a proxy for cost control. This systematically excluded cheap-but-slow models — DeepSeek V4 Flash, Kimi K2.6, GLM 4.7 Flash — even though they cost fractions of a cent per benchmark.
Meanwhile expensive models (opencode-sonnet-4-6 at ~$0.20/call × multi-turn = $13.38 across 68 runs) were not effectively constrained by wall-clock — each call was fast, just expensive.
Cost is the right dimension. v0.15.1 makes it explicit.
How it works
Each *executor.Task carries a *CostBudget (nullable). The 5 executors (claude/opencode/gemini/codex/pi) call budget.Add(input, output) at their natural token-tally event point:
| Executor | Hook event |
|---|---|
| claude | stream-json usage event |
| opencode | stream-json tool_result / step_finish event |
| gemini | post-hoc at result event (no incremental usage from Gemini CLI) |
| codex | turn boundary |
| pi | per-turn message_end event |
When the running cost exceeds MaxCostUSD, the executor returns early with Result.CostKilledAt = current and stops reading the stream. This is a distinct failure category from api_error/timeout/logic_error in the dashboard.
Per-model budgets: block
Override defaults per-model in models.yml:
opencode-or-minimax-m2-7:
pricing:
input_per_1k: 0.0003
output_per_1k: 0.0012
budgets:
max_cost_usd: 0.30 # generous cost ceiling — 30¢ for cheap model
hard_timeout_secs: 600 # 10 min wall-clock SAFETY NET
expected_ttft_secs: 30 # for "abnormally slow" alerts (advisory)
expected_ttf_solution_secs: 90
When budgets: is omitted, defaults apply:
max_cost_usd = min($0.50, input_per_1k × 64 + output_per_1k × 32)
hard_timeout_secs = 600
Examples of resolved defaults:
| Model | Pricing | Default max_cost_usd |
|---|---|---|
or-minimax-m2-7 | $0.30/$1.20 per 1M | $0.058 (formula) |
or-glm-5 | $0.60/$2.08 per 1M | $0.105 (formula) |
claude-opus-4-7 | $15/$75 per 1M | $0.50 (clipped to ceiling) |
ollama-codellama (free) | $0/$0 | $0 (no enforcement) |
Speed metrics in Result
The eval harness now records speed observables alongside cost:
| Field | Meaning |
|---|---|
FirstAttemptMs | ms from task start to first solution submission (first Write/Edit tool call or first text fallback) |
SuccessAtMs | ms from task start to passing solution (-1 if not measured) |
Turns | agent-mode turn count (already existed, now top-level) |
TokensPerSec | OutputTokens / generation_seconds |
CostKilledAt | running cost at kill time (0 if not killed) |
These flow into the dashboard's efficiency block per model:
"efficiency": {
"median_time_to_first_attempt_ms": 8400,
"median_time_to_success_ms": 42000,
"median_turns_to_success": 3,
"median_tokens_per_sec": 45.2,
"p90_cost_per_success": 0.18,
"speed_efficiency_score": 0.73,
"cost_killed_count": 0
}
speed_efficiency_score = success_rate / (1 + median_TTS_seconds / 60) — a [0..1] score the dashboard uses to rank models on the cost-vs-speed Pareto frontier.
Dashboard charts
Two new charts visualize the new metrics:
- Speed Radar — median time-to-success per model, outlier-clipped at 5× median (mirroring the v0.15.0 cost radar pattern)
- Cost-Speed Frontier — Pareto scatter (x = $/success log scale, y = sec/success log scale, marker size = success rate, color = harness). Dominated points highlighted; frontier line drawn through optimal models.
The existing PerModelTrend chart gains a metric dropdown: Success Rate / Time to Success / Cost per Success.
Migration guide
For existing baselines (v0.14.x, v0.15.0):
- Pre-v0.15.1 result JSONs lack
cost_killed_at/first_attempt_ms/success_at_ms/tokens_per_secfields → all default to 0 median_time_to_success_msfalls back toduration_msfor these baselines (ComputeEfficiencyusesDurationMswhenSuccessAtMs <= 0)- Dashboard Speed Radar will show realistic data for v0.15.1+ baselines; pre-v0.15.1 entries appear with full DurationMs (post-hoc) instead of true TTS
For new evaluations:
- No action needed — the eval harness automatically populates
Task.Budgetfrom each model's resolvedMaxCostUSD - Pass
--agent-parallel Nas before; budgets enforce per-Task, parallelism unchanged - Watch for
cost_killedrows in the Top Error Codes table after large baseline runs
Disabling cost enforcement
To run with legacy wall-clock-only behaviour (e.g., debugging a stuck eval):
some-model:
budgets:
max_cost_usd: 0 # disable enforcement; passive token tally only
hard_timeout_secs: 60 # restore original 60s ceiling
MaxCostUSD == 0 makes CostBudget.Add() always return exceeded=false. Useful for back-compat testing and replay verification.
See also
- Design doc — full architectural rationale + axiom scoring
- Sprint plan — milestone breakdown
- Model configuration guide — full
models.ymlschema