Skip to main content

OS / Local-model leaderboard

Pass-rate and trend data for the open-source, locally-hosted models that AILANG's eval rig runs continuously on dedicated hardware (opencode + local Ollama on a 128 GB Apple-silicon rig).

This is the home for the cross-language and cross-harness comparison: the cloud leaderboards (Model Leaderboard, ELO) cover AILANG + Python, while the multi-language story (JavaScript, Go) and the harness comparison (opencode / claude / codex / pi) live here — because running N-trial, multi-language sweeps on pay-per-token cloud APIs is expensive, and the local rig does it for zero server cost.

Why a separate section

Pay-per-token cloud APIs make N-trial evaluation expensive — you usually see "one shot, one model, one benchmark." Running on the local rig lets the project publish, at no per-token cost:

  • N≥3 trials per (model, benchmark) so variance is visible, not hidden behind a single roll.
  • Cross-language — the same benchmark in AILANG, Python, JavaScript, and Go.
  • Cross-harness — the same model through different agentic CLIs.
  • Longitudinal trend deltas across releases — same benchmark + model, last release vs. this.

The numbers feed the language-design feedback loop: a benchmark that fails persistently across N trials becomes a candidate for a stdlib/prompt/syntax fix, then re-measured at the next release.

How the data is published

Local-rig rotations are published as static data (no live server) via ailang eval-publish, which walks a rotation directory for summary.json files (produced by make eval-smoke … --trials N), merges per (benchmark, model, lang), and emits a per-release snapshot with an optional trend-delta section vs. the previous release:

ailang eval-publish <version> \
--rotation eval_results/rotation/<date> \
--prev eval_results/rotation/<prev-date> \
--prev-tag <prev-version>

See the local-Ollama evaluation guide for how the rotation directories get populated.

Current data

Loading local-rig data…

The historical v0.23.0 snapshots were retired during the benchmark-docs consolidation; the table above populates from the latest published local-rig rotation (/benchmarks/os/latest.json). Until the next rotation publishes, the cloud AILANG-vs-Python leaderboards are on the Model Leaderboard and ELO Ratings pages.