Model Leaderboard
Real-world performance metrics for AILANG and Python across multiple AI models, harnesses, and languages.
- Benchmark Explorer → — filter by language, harness, and model; cross-harness comparison table
- Benchmark Gallery → — browse all benchmark tasks with pass rates and code samples
- Value Score → — cost vs quality vs speed analysis with Pareto frontier
Evaluation Modes
This page shows results from three complementary evaluation approaches:
| Mode | What it tests | Metric |
|---|---|---|
| Standard API (0-shot) | Direct model API call — does the model produce correct code on the first attempt? | zeroShotSuccess |
| Self-repair | One additional attempt after failure — does error feedback help? | finalSuccess |
| Agent mode | Agentic CLI (Claude Code / Gemini CLI / opencode / Codex) with multi-turn iteration — real-world developer workflow | agentSuccessRate |
Agent mode results are also shown in the Benchmark Explorer broken down by language and harness.
Model Comparison
Compare AI model performance across multiple dimensions:
Loading benchmark data...
What These Numbers Mean
Our benchmark suite tests AI models' ability to generate correct, working code across 4 languages.
Success Metrics
- 0-Shot Success: Code works on first try (no repairs)
- Final Success: Code works after M-EVAL-LOOP self-repair
- Agent Success: Code works via multi-turn agentic iteration
Why This Matters
These benchmarks demonstrate:
- Type Safety Works: AILANG's type system catches errors early
- Effects Are Clear: Explicit effect annotations help AI models
- Patterns Are Learnable: AI models understand functional programming
- Room to Grow: Benchmarks identify language gaps and guide development
How Benchmarks Guide Development
The M-EVAL-LOOP system uses these benchmarks to:
- Identify Bugs: Failing benchmarks reveal language issues
- Validate Fixes: Compare before/after to confirm improvements
- Track Progress: Historical data shows language evolution
- Prioritize Features: High-impact failures guide roadmap
Case Study: Float Equality Bug
The adt_option benchmark caught a critical bug where float comparisons with variables called eq_Int instead of eq_Float. The benchmark suite detected it, guided the fix, and validated the solution.
Result: Benchmark went from runtime_error → PASSING ✅
Try It Yourself
Want to see AILANG in action?
- Interactive REPL - Try AILANG in your browser
- Code Examples - 48+ working examples
- Getting Started - Install and run locally
Methodology: Benchmarks use deterministic seeds across multiple AI models. Each benchmark tests code generation, compilation, and execution. The M-EVAL-LOOP system provides structured error feedback for automatic repair.
Learn More: M-EVAL-LOOP Design | Evaluation Guide