Skip to main content

Model Leaderboard

Real-world performance metrics for AILANG and Python across multiple AI models, harnesses, and languages.

Explore and browse

Evaluation Modes

This page shows results from three complementary evaluation approaches:

ModeWhat it testsMetric
Standard API (0-shot)Direct model API call — does the model produce correct code on the first attempt?zeroShotSuccess
Self-repairOne additional attempt after failure — does error feedback help?finalSuccess
Agent modeAgentic CLI (Claude Code / Gemini CLI / opencode / Codex) with multi-turn iteration — real-world developer workflowagentSuccessRate

Agent mode results are also shown in the Benchmark Explorer broken down by language and harness.

Model Comparison

Compare AI model performance across multiple dimensions:

Loading benchmark data...

Loading benchmark data...

What These Numbers Mean

Our benchmark suite tests AI models' ability to generate correct, working code across 4 languages.

Success Metrics

  • 0-Shot Success: Code works on first try (no repairs)
  • Final Success: Code works after M-EVAL-LOOP self-repair
  • Agent Success: Code works via multi-turn agentic iteration

Why This Matters

These benchmarks demonstrate:

  1. Type Safety Works: AILANG's type system catches errors early
  2. Effects Are Clear: Explicit effect annotations help AI models
  3. Patterns Are Learnable: AI models understand functional programming
  4. Room to Grow: Benchmarks identify language gaps and guide development

How Benchmarks Guide Development

The M-EVAL-LOOP system uses these benchmarks to:

  1. Identify Bugs: Failing benchmarks reveal language issues
  2. Validate Fixes: Compare before/after to confirm improvements
  3. Track Progress: Historical data shows language evolution
  4. Prioritize Features: High-impact failures guide roadmap

Case Study: Float Equality Bug

The adt_option benchmark caught a critical bug where float comparisons with variables called eq_Int instead of eq_Float. The benchmark suite detected it, guided the fix, and validated the solution.

Result: Benchmark went from runtime_error → PASSING ✅

Try It Yourself

Want to see AILANG in action?


Methodology: Benchmarks use deterministic seeds across multiple AI models. Each benchmark tests code generation, compilation, and execution. The M-EVAL-LOOP system provides structured error feedback for automatic repair.

Learn More: M-EVAL-LOOP Design | Evaluation Guide