Benchmarks Overview
AILANG continuously evaluates how well coding agents synthesize correct AILANG (and, for comparison, other languages). The dashboards split along two axes — pick the one that answers your question:
1. Cloud frontier — AILANG vs Python
The best pay-per-token API models, on AILANG and Python. This is the "how capable is the
frontier on AILANG?" view. Numbers are regraded (formatting parity — a correct answer isn't
marked wrong over True vs true or 7.50 vs 7.5).
- Model Leaderboard — per-model pass rates, AILANG-vs-Python gap, trend over releases.
- ELO Ratings & Difficulty — capability rating per model, difficulty rating per benchmark (derived, not hand-assigned), saturation + grader-artifact flags. Split by mode (standard vs agent).
- Value Score — cost vs quality vs speed.
2. OS / Local models — cross-language, cross-harness, longitudinal
Open-source / locally-hosted models, run continuously on a local rig at zero server cost. This is where the multi-language (JavaScript, Go) and harness (opencode / claude / codex) comparisons live, with N≥3 trials and per-release trends.
- OS / Local Leaderboard — cross-language × harness, longitudinal.
- Agent Harness Explorer — agent-mode cross-harness, by language.
Two rules that keep the numbers honest
- Standard and agent never mix. Standard = 0-shot + self-repair via the API; agent = multi-turn agentic CLI. Every chart is one or the other, labeled — a model only appears on a view for the modes it actually ran.
- Cloud is AILANG+Python; multi-language is an OS/local concern. Running N-trial JS/Go sweeps on cloud APIs is expensive; the local rig does it for free.
Also
- Benchmark Gallery — browse the benchmark suite by tier (prompts, expected output, per-language pass rates).
- Codebase Statistics — AILANG repository growth over time.