Skip to main content

OS-model smoke leaderboard

Per-release pass-rate matrices for the open-source / locally-hosted Ollama models that AILANG's eval rig runs continuously on dedicated hardware.

Why this section exists

Pay-per-token cloud APIs make N-trial evaluation expensive — you usually see "one shot, one model, one benchmark." Running on a 128 GB Apple-silicon rig with opencode + local Ollama lets the AILANG project publish:

  • N≥3 trials per (model, benchmark) so variance is visible, not hidden behind a single roll.
  • Trend deltas across releases — same benchmark, same model, last release vs. this release.
  • Cross-model comparison at fixed snapshot points — gemma4-26b vs. qwen3-coder-30b vs. devstral on the same prompts and rubric.

The numbers feed the language-design feedback loop: a benchmark that fails persistently across N trials becomes a candidate for a stdlib/prompt/syntax fix, which is then re-measured against the same (model, benchmark) at the next release.

How a release page is generated

Each release page is auto-built by ailang eval-publish:

ailang eval-publish v0.23.0 \
--rotation eval_results/rotation/2026-05-23 \
--prev eval_results/rotation/2026-04-15 \
--prev-tag v0.22.0

The command walks the rotation directory for summary.json files (one per rotation slot, produced by make eval-smoke ... --trials N), merges per (benchmark, model, lang), and emits the per-release page with optional trend-delta section vs. the previous release.

See the evaluation guide for how the rotation directories get populated.

Releases

Browse the per-release leaderboards from the sidebar to see how each AILANG release performs across the open-model rotation.