Skip to main content

Benchmarks Overview

AILANG continuously evaluates how well coding agents synthesize correct AILANG (and, for comparison, other languages). The dashboards split along two axes — pick the one that answers your question:

1. Cloud frontier — AILANG vs Python

The best pay-per-token API models, on AILANG and Python. This is the "how capable is the frontier on AILANG?" view. Numbers are regraded (formatting parity — a correct answer isn't marked wrong over True vs true or 7.50 vs 7.5).

  • Model Leaderboard — per-model pass rates, AILANG-vs-Python gap, trend over releases.
  • ELO Ratings & Difficulty — capability rating per model, difficulty rating per benchmark (derived, not hand-assigned), saturation + grader-artifact flags. Split by mode (standard vs agent).
  • Value Score — cost vs quality vs speed.

2. OS / Local models — cross-language, cross-harness, longitudinal

Open-source / locally-hosted models, run continuously on a local rig at zero server cost. This is where the multi-language (JavaScript, Go) and harness (opencode / claude / codex) comparisons live, with N≥3 trials and per-release trends.

Two rules that keep the numbers honest

  • Standard and agent never mix. Standard = 0-shot + self-repair via the API; agent = multi-turn agentic CLI. Every chart is one or the other, labeled — a model only appears on a view for the modes it actually ran.
  • Cloud is AILANG+Python; multi-language is an OS/local concern. Running N-trial JS/Go sweeps on cloud APIs is expensive; the local rig does it for free.

Also