Model Leaderboard

Real-world performance metrics for AILANG and Python across multiple AI models, harnesses, and languages.

Explore and browse

Benchmark Explorer → — filter by language, harness, and model; cross-harness comparison table
Benchmark Gallery → — browse all benchmark tasks with pass rates and code samples
Value Score → — cost vs quality vs speed analysis with Pareto frontier

Evaluation Modes

This page shows results from three complementary evaluation approaches:

Mode	What it tests	Metric
Standard API (0-shot)	Direct model API call — does the model produce correct code on the first attempt?	`zeroShotSuccess`
Self-repair	One additional attempt after failure — does error feedback help?	`finalSuccess`
Agent mode	Agentic CLI (Claude Code / Gemini CLI / opencode / Codex) with multi-turn iteration — real-world developer workflow	`agentSuccessRate`

Agent mode results are also shown in the Benchmark Explorer broken down by language and harness.

Model Comparison

Compare AI model performance across multiple dimensions:

Loading benchmark data...

What These Numbers Mean

Our benchmark suite tests AI models' ability to generate correct, working code across 4 languages.

Success Metrics

0-Shot Success: Code works on first try (no repairs)
Final Success: Code works after M-EVAL-LOOP self-repair
Agent Success: Code works via multi-turn agentic iteration

Why This Matters

These benchmarks demonstrate:

Type Safety Works: AILANG's type system catches errors early
Effects Are Clear: Explicit effect annotations help AI models
Patterns Are Learnable: AI models understand functional programming
Room to Grow: Benchmarks identify language gaps and guide development

How Benchmarks Guide Development

The M-EVAL-LOOP system uses these benchmarks to:

Identify Bugs: Failing benchmarks reveal language issues
Validate Fixes: Compare before/after to confirm improvements
Track Progress: Historical data shows language evolution
Prioritize Features: High-impact failures guide roadmap

Case Study: Float Equality Bug

The adt_option benchmark caught a critical bug where float comparisons with variables called eq_Int instead of eq_Float. The benchmark suite detected it, guided the fix, and validated the solution.

Result: Benchmark went from runtime_error → PASSING ✅

Try It Yourself

Want to see AILANG in action?

Interactive REPL - Try AILANG in your browser
Code Examples - 48+ working examples
Getting Started - Install and run locally

Methodology: Benchmarks use deterministic seeds across multiple AI models. Each benchmark tests code generation, compilation, and execution. The M-EVAL-LOOP system provides structured error feedback for automatic repair.

Learn More: M-EVAL-LOOP Design | Evaluation Guide

Evaluation Modes​

Model Comparison​

What These Numbers Mean​

Success Metrics​

Why This Matters​

How Benchmarks Guide Development​

Case Study: Float Equality Bug​

Try It Yourself​