Value Score
Cost-quality-speed analysis for the AILANG eval suite. Which model gives you the most pass-rate per dollar — or per second?
- Score vs Cost — LMArena-style: where does each model sit on the cost / accuracy plane?
- Score vs Speed — interactive vs batch tradeoff
- Value Score Table — weighted ranking with N=1..4 quality weighting, plus Pareto-frontier flags
Methodology
Each model's value score combines three orthogonal axes:
score = pass_rate^N / (cost_per_success × (1 + median_TTS_seconds / 60))
pass_rate— fraction of benchmarks the model solved correctly (final, with self-repair)cost_per_success— total dollars spent ÷ number of successful runsmedian_TTS_seconds— wall-clock time to a passing solution (post-v0.15.1; older runs fall back to total duration)N— quality weighting:N=1— pure cost efficiency (raw $/success)N=2— balanced (default recommendation)N=3— quality matters more than savingsN=4— quality dominates
The Pareto frontier is the set of models where no other model is both cheaper and higher pass-rate. Picking a model off the frontier means you're paying more for less. Models on the frontier represent genuine engineering tradeoffs.
Why this matters for AILANG
AILANG's eval suite measures real-world AI code generation across multiple harnesses. Without a unified score, comparing models requires juggling pass-rate tables, cost spreadsheets, and latency dashboards.
The score formula here is intentionally simple — multiplicative, transparent, and tunable via the N exponent. It rewards models that are simultaneously cheap, accurate, and fast.
For the underlying definitions and methodology behind cost-as-primary-gate eval semantics (introduced in v0.15.1), see the Cost-and-Speed Budgets guide.