AI Evaluation Framework (M-EVAL-LOOP v2.0)
This directory contains documentation for AILANG's AI evaluation framework, which measures how well AI models can generate AILANG code and provides automated feedback loops for continuous improvement.
Overview
M-EVAL-LOOP is designed to empirically measure the "AI teachability" of AILANG - one of the project's key success metrics. It:
- Compares AI code generation across AILANG vs Python
- Tracks performance across multiple models and benchmarks
- Provides automated analysis and fix suggestions
- Validates fixes and measures improvements
Quick Start
Cost-Conscious Development (Recommended)
Use cheaper/faster models for daily development:
# Quick dev check (gpt5-mini, gemini-2-5-flash)
ailang eval-suite
# Create baseline
make eval-baseline
# Analyze failures
ailang eval-analyze -results eval_results/baselines/v0.3.0 -dry-run
Cost: ~$0.0003-0.002 per benchmark (5-10x cheaper than full suite)
Full Comprehensive Testing
Use expensive models for release validation:
# Full suite (gpt5, claude-sonnet-4-5, gemini-2-5-pro)
ailang eval-suite --full
# Full baseline
make eval-baseline FULL=true
# Compare results
ailang eval-compare baselines/v0.3.0 current
Cost: ~$0.003-0.015 per benchmark
Custom Model Selection
# Test specific models
ailang eval-suite --models gpt5,claude-sonnet-4-5
# With self-repair
ailang eval-suite --models gpt5 --self-repair
Documentation
Core Guides
- architecture.md - System architecture and command reference
- eval-loop.md - Automated evaluation and improvement workflow
- model-configuration.md - Model setup and pricing
- cost-and-speed-budgets.md - Cost-as-primary-gate eval semantics (v0.15.1+)
Implementation Details
- go-implementation.md - Native Go implementation guide
- migration-guide.md - Migration from bash to Go
- baseline-tests.md - Running baseline tests
Available Models (October 2025)
Production Models (--full)
- Claude Sonnet 4.5 (Anthropic) - $3/$15 per 1M tokens
- GPT-5 (OpenAI) - $1.25/$10 per 1M tokens
- Gemini 2.5 Pro (Google) - $1.25/$10 per 1M tokens
Development Models (default)
- GPT-5 Mini (OpenAI) - $0.25/$2 per 1M tokens (~1/5 price)
- Gemini 2.5 Flash (Google) - $0.30/$2.50 per 1M tokens (~1/4 price)
See model-configuration.md for setup details.
Prerequisites
Set up at least one model's API key:
# Anthropic Claude (recommended for coding)
export ANTHROPIC_API_KEY="sk-ant-..."
# OpenAI GPT
export OPENAI_API_KEY="sk-proj-..."
# Google Gemini (Application Default Credentials)
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
# OR set API key
export GOOGLE_API_KEY="..."
Python runtime (for cross-language benchmarks)
Python-targeted benchmarks run against a pinned Python version managed by uv. This means every machine — laptop, CI runner, contributor fork — executes the exact same interpreter, and the prompt can truthfully advertise the target version to the model (avoiding "the model wrote 3.10 match/case but the grader runs 3.9" failures).
- Pinned version: Python 3.12 (defined by
PinnedPythonVersionininternal/eval_harness/python.go). - Invocation: the harness spawns
uv run --python 3.12 solution.py.uvdownloads and caches the pinned CPython on first use — no system Python required. - Missing
uvfails loudly. Ifuvis not on PATH the harness returns a clear error for every Python benchmark rather than silently falling back to a systempython3. Override for local debugging withAILANG_UV=/path/to/uv. - CI: both
.github/workflows/ci.ymland.github/workflows/eval-weekly.ymluseastral-sh/setup-uv@v4. - Local setup: install uv once. On macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh. First Python benchmark on a fresh machine pays a ~2s CPython download; everything after that is cached.
The pinned version is also injected into the Python teaching prompt (prompts/python.md) and task template via the {{PYTHON_VERSION}} placeholder, so the model sees the same version string the grader will execute.
Commands Overview
Standard Eval (0-shot API)
| Command | Purpose | Example |
|---|---|---|
ailang eval-suite | Run benchmarks | ailang eval-suite --full |
ailang eval-compare | Compare two runs | ailang eval-compare baseline current |
ailang eval-analyze | Analyze failures | ailang eval-analyze -results dir -dry-run |
ailang eval-matrix | Generate performance matrix | ailang eval-matrix results/ v0.3.0 |
ailang eval-summary | Export to JSONL | ailang eval-summary results/ |
ailang eval-report | Generate reports | ailang eval-report results/ v0.3.0 |
Agent Eval (agentic coding via CLI - v0.8.0+)
| Command | Purpose | Example |
|---|---|---|
ailang eval-suite --agent | Run agent benchmarks | ailang eval-suite --agent --models claude-haiku-4-5 |
ailang eval-chains list | List eval chains | ailang eval-chains list |
ailang eval-chains view | View chain results | ailang eval-chains view <id> |
ailang eval-chains stats | Pass rate breakdown | ailang eval-chains stats <id> |
ailang eval-chains failures | Show failures only | ailang eval-chains failures <id> |
ailang eval-report --from-chain | Report from chain | ailang eval-report --from-chain <id> v0.8.0 |
ailang eval-compare --chain | Compare two chains | ailang eval-compare --chain <id1> --chain <id2> |
See architecture.md for detailed command reference.
Benchmarks
The benchmark suite is organized into four tiers that serve different
release-review purposes. Every benchmark YAML declares a tier and 1–3
tags. See benchmarks/CURATION.md
for the full curation guide, promotion/demotion criteria, and rotation rules.
Tier structure (v0.14.0+)
| Tier | Expected pass | Run cost | Role |
|---|---|---|---|
smoke | ≥ 95% | seconds | Regression gate — flat near 100% |
core | 70–95% | minutes | Headline metric — what releases quote |
stretch | 30–70% | minutes | Headroom / differentiation benchmarks |
vision | 0–50% | variable | Aspirational — measures potential |
The Core tier pass rate is the primary headline metric on the dashboard
and the number quoted in release notes. Dashboard at
ailang.sunholo.com/benchmarks now
renders a tier toggle; per-tier aggregates live under tiers.<name> in
docs/static/benchmarks/latest.json.
Running tier subsets
# Fast regression check (run before every commit)
make eval-smoke # smoke tier only
# Release gate (run before every tag)
make eval-core # core tier only — this produces the headline metric
# Differentiation check
make eval-stretch # stretch tier only
# Arbitrary tier combination via --tier flag
ailang eval-suite --tier smoke,core --models claude-haiku-4-5
Tags
Tags describe the AILANG feature surface a benchmark exercises. Use them to localise regressions:
ailang eval-matrix <baseline_dir> <version> --by-tags
Canonical tags: adt_pattern_match, algorithmic, contracts,
data_transform, effects_io, error_handling, functional, records,
recursion, state_machine, string_algo, type_safety. New tags require
a CURATION.md update so the taxonomy stays stable.
Rotation tools
ailang eval-matrix <dir> <ver> --show-saturated— benchmarks at 100% across all models × all languages (retirement candidates).ailang eval-matrix <dir> <ver> --ailang-wins—(benchmark, model)cells where AILANG passes and Python fails (differentiation signal)..claude/skills/eval-analyzer/scripts/benchmark_health.sh— one-shot rotation report combining saturation, refusal detection, and tier promotion signals.
Results Location
After running benchmarks:
Standard Eval (file-based)
- JSON:
eval_results/baselines/VERSION/*.json- Full details per run - Matrix:
eval_results/baselines/VERSION/matrix.json- Aggregated stats - Dashboard:
docs/docs/benchmarks/performance.md- Live leaderboard