AI Evaluation Framework (M-EVAL-LOOP v2.0)

This directory contains documentation for AILANG's AI evaluation framework, which measures how well AI models can generate AILANG code and provides automated feedback loops for continuous improvement.

Overview

M-EVAL-LOOP is designed to empirically measure the "AI teachability" of AILANG - one of the project's key success metrics. It:

Compares AI code generation across AILANG vs Python
Tracks performance across multiple models and benchmarks
Provides automated analysis and fix suggestions
Validates fixes and measures improvements

Quick Start

Cost-Conscious Development (Recommended)

Use cheaper/faster models for daily development:

# Quick dev check (gpt5-mini, gemini-2-5-flash)
ailang eval-suite

# Create baseline
make eval-baseline

# Analyze failures
ailang eval-analyze -results eval_results/baselines/v0.3.0 -dry-run

Cost: ~$0.0003-0.002 per benchmark (5-10x cheaper than full suite)

Full Comprehensive Testing

Use expensive models for release validation:

# Full suite (gpt5, claude-sonnet-4-5, gemini-2-5-pro)
ailang eval-suite --full

# Full baseline
make eval-baseline FULL=true 

# Compare results
ailang eval-compare baselines/v0.3.0 current

Cost: ~$0.003-0.015 per benchmark

Custom Model Selection

# Test specific models
ailang eval-suite --models gpt5,claude-sonnet-4-5

# With self-repair
ailang eval-suite --models gpt5 --self-repair

Documentation

Core Guides

architecture.md - System architecture and command reference
eval-loop.md - Automated evaluation and improvement workflow
model-configuration.md - Model setup and pricing
cost-and-speed-budgets.md - Cost-as-primary-gate eval semantics (v0.15.1+)

Implementation Details

go-implementation.md - Native Go implementation guide
migration-guide.md - Migration from bash to Go
baseline-tests.md - Running baseline tests

Available Models (October 2025)

Production Models (--full)

Claude Sonnet 4.5 (Anthropic) - $3/$15 per 1M tokens
GPT-5 (OpenAI) - $1.25/$10 per 1M tokens
Gemini 2.5 Pro (Google) - $1.25/$10 per 1M tokens

Development Models (default)

GPT-5 Mini (OpenAI) - $0.25/$2 per 1M tokens (~1/5 price)
Gemini 2.5 Flash (Google) - $0.30/$2.50 per 1M tokens (~1/4 price)

See model-configuration.md for setup details.

Prerequisites

Set up at least one model's API key:

# Anthropic Claude (recommended for coding)
export ANTHROPIC_API_KEY="sk-ant-..."

# OpenAI GPT
export OPENAI_API_KEY="sk-proj-..."

# Google Gemini (Application Default Credentials)
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
# OR set API key
export GOOGLE_API_KEY="..."

Python runtime (for cross-language benchmarks)

Python-targeted benchmarks run against a pinned Python version managed by uv. This means every machine — laptop, CI runner, contributor fork — executes the exact same interpreter, and the prompt can truthfully advertise the target version to the model (avoiding "the model wrote 3.10 match/case but the grader runs 3.9" failures).

Pinned version: Python 3.12 (defined by PinnedPythonVersion in internal/eval_harness/python.go).
Invocation: the harness spawns uv run --python 3.12 solution.py. uv downloads and caches the pinned CPython on first use — no system Python required.
Missing uv fails loudly. If uv is not on PATH the harness returns a clear error for every Python benchmark rather than silently falling back to a system python3. Override for local debugging with AILANG_UV=/path/to/uv.
CI: both .github/workflows/ci.yml and .github/workflows/eval-weekly.yml use astral-sh/setup-uv@v4.
Local setup: install uv once. On macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh. First Python benchmark on a fresh machine pays a ~2s CPython download; everything after that is cached.

The pinned version is also injected into the Python teaching prompt (prompts/python.md) and task template via the {{PYTHON_VERSION}} placeholder, so the model sees the same version string the grader will execute.

Commands Overview

Standard Eval (0-shot API)

Command	Purpose	Example
`ailang eval-suite`	Run benchmarks	`ailang eval-suite --full`
`ailang eval-compare`	Compare two runs	`ailang eval-compare baseline current`
`ailang eval-analyze`	Analyze failures	`ailang eval-analyze -results dir -dry-run`
`ailang eval-matrix`	Generate performance matrix	`ailang eval-matrix results/ v0.3.0`
`ailang eval-summary`	Export to JSONL	`ailang eval-summary results/`
`ailang eval-report`	Generate reports	`ailang eval-report results/ v0.3.0`

Agent Eval (agentic coding via CLI - v0.8.0+)

Command	Purpose	Example
`ailang eval-suite --agent`	Run agent benchmarks	`ailang eval-suite --agent --models claude-haiku-4-5`
`ailang eval-chains list`	List eval chains	`ailang eval-chains list`
`ailang eval-chains view`	View chain results	`ailang eval-chains view <id>`
`ailang eval-chains stats`	Pass rate breakdown	`ailang eval-chains stats <id>`
`ailang eval-chains failures`	Show failures only	`ailang eval-chains failures <id>`
`ailang eval-report --from-chain`	Report from chain	`ailang eval-report --from-chain <id> v0.8.0`
`ailang eval-compare --chain`	Compare two chains	`ailang eval-compare --chain <id1> --chain <id2>`

See architecture.md for detailed command reference.

Benchmarks

The benchmark suite is organized into four tiers that serve different release-review purposes. Every benchmark YAML declares a tier and 1–3 tags. See benchmarks/CURATION.md for the full curation guide, promotion/demotion criteria, and rotation rules.

Tier structure (v0.14.0+)

Tier	Expected pass	Run cost	Role
`smoke`	≥ 95%	seconds	Regression gate — flat near 100%
`core`	70–95%	minutes	Headline metric — what releases quote
`stretch`	30–70%	minutes	Headroom / differentiation benchmarks
`vision`	0–50%	variable	Aspirational — measures potential

The Core tier pass rate is the primary headline metric on the dashboard and the number quoted in release notes. Dashboard at ailang.sunholo.com/benchmarks now renders a tier toggle; per-tier aggregates live under tiers.<name> in docs/static/benchmarks/latest.json.

Running tier subsets

# Fast regression check (run before every commit)
make eval-smoke          # smoke tier only

# Release gate (run before every tag)
make eval-core           # core tier only — this produces the headline metric

# Differentiation check
make eval-stretch        # stretch tier only

# Arbitrary tier combination via --tier flag
ailang eval-suite --tier smoke,core --models claude-haiku-4-5

Rotation tools

ailang eval-matrix <dir> <ver> --show-saturated — benchmarks at 100% across all models × all languages (retirement candidates).
ailang eval-matrix <dir> <ver> --ailang-wins — (benchmark, model) cells where AILANG passes and Python fails (differentiation signal).
.claude/skills/eval-analyzer/scripts/benchmark_health.sh — one-shot rotation report combining saturation, refusal detection, and tier promotion signals.

Results Location

After running benchmarks:

Standard Eval (file-based)

JSON: eval_results/baselines/VERSION/*.json - Full details per run
Matrix: eval_results/baselines/VERSION/matrix.json - Aggregated stats
Dashboard: docs/docs/benchmarks/performance.md - Live leaderboard

Agent Eval (chain-based - v0.8.0+)

Database: ~/.ailang/state/observatory.db - Chains with per-benchmark stages
Query: ailang eval-chains view <chain-id> - View results
Export: ailang eval-report --from-chain <id> VERSION - Generate reports from chains

Key Metrics

The framework tracks:

Success Rate: % of attempts that compile, run, and produce correct output
Token Efficiency: Input/output tokens used per attempt
Cost: Actual API cost based on model pricing
Error Categories: compile_error, runtime_error, logic_error
Self-Repair: First attempt success vs. repair success

Typical Workflow

Standard Eval Workflow (0-shot API)

1. Create Baseline

make eval-baseline              # Quick baseline (dev models)
# OR
make eval-baseline FULL=true    # Full baseline (all models)

2. Make Changes

# Edit code, update prompts, etc.

3. Analyze Results

ailang eval-analyze -results eval_results/current -dry-run  # Check failures

4. Compare Results

ailang eval-compare baselines/v0.3.6 current

5. Update Dashboard

ailang eval-report eval_results/baselines/VERSION VERSION --format=json

Agent Eval Workflow (v0.8.0+ - agentic coding)

1. Run Agent Eval

ailang eval-suite --agent --models claude-haiku-4-5,gemini-2-5-flash

2. Find Chain ID

ailang eval-chains list        # Shows most recent chains

3. Analyze Results

ailang eval-chains stats <id>       # Pass rates by model/language/benchmark
ailang eval-chains failures <id>    # View failures with error details
ailang chains view <id> --spans     # Full chain with session data

4. Compare Runs

ailang eval-compare --chain <id1> --chain <id2>

5. Generate Report

ailang eval-report --from-chain <id> v0.8.0 --format=json

Natural Language Interface

You can also use natural language with Claude Code:

<CheckIcon inline size={14} /> "validate my fix for records"
<CheckIcon inline size={14} /> "how is AILANG performing?"
<CheckIcon inline size={14} /> "compare baseline to current"
<CheckIcon inline size={14} /> "generate an HTML report for v0.3.6"

The eval-orchestrator agent automatically routes to the correct commands.

Architecture

M-EVAL-LOOP uses a two-tier architecture:

Native Go Commands - Fast, type-safe execution (90%+ test coverage)
Smart Agents - Natural language interface and workflow automation

See architecture.md for complete details.

Target KPIs

AI Teachability: 80%+ success rate on all benchmarks
Token Efficiency: AILANG should use ≤ Python tokens (concise syntax)
Cost Efficiency: Dev models viable for daily development
Error Quality: Clear categorization for targeted improvements

Development Cycle

make eval-baseline           # Store current state
<make changes>               # Implement features/fixes
ailang eval-analyze ...      # Analyze failures
ailang eval-compare ...      # Full comparison
ailang eval-report ...       # Update public dashboard
Repeat!

Overview​

Quick Start​

Cost-Conscious Development (Recommended)​

Full Comprehensive Testing​

Custom Model Selection​

Documentation​

Core Guides​

Implementation Details​

Available Models (October 2025)​

Production Models (--full)​

Development Models (default)​

Prerequisites​

Python runtime (for cross-language benchmarks)​

Commands Overview​

Standard Eval (0-shot API)​

Agent Eval (agentic coding via CLI - v0.8.0+)​

Benchmarks​

Tier structure (v0.14.0+)​

Running tier subsets​

Tags​

Rotation tools​

Results Location​

Standard Eval (file-based)​

Agent Eval (chain-based - v0.8.0+)​

Key Metrics​

Typical Workflow​

Standard Eval Workflow (0-shot API)​

1. Create Baseline​

2. Make Changes​

3. Analyze Results​

4. Compare Results​

5. Update Dashboard​

Agent Eval Workflow (v0.8.0+ - agentic coding)​

1. Run Agent Eval​

2. Find Chain ID​

3. Analyze Results​

4. Compare Runs​

5. Generate Report​

Natural Language Interface​

Architecture​

Target KPIs​

Development Cycle​

See Also​

Overview

Quick Start

Cost-Conscious Development (Recommended)

Full Comprehensive Testing

Custom Model Selection

Documentation

Core Guides

Implementation Details

Available Models (October 2025)

Production Models (--full)

Development Models (default)

Prerequisites

Python runtime (for cross-language benchmarks)

Commands Overview

Standard Eval (0-shot API)

Agent Eval (agentic coding via CLI - v0.8.0+)

Benchmarks

Tier structure (v0.14.0+)

Running tier subsets

Tags

Rotation tools

Results Location

Standard Eval (file-based)

Agent Eval (chain-based - v0.8.0+)

Key Metrics

Typical Workflow

Standard Eval Workflow (0-shot API)

1. Create Baseline

2. Make Changes

3. Analyze Results

4. Compare Results

5. Update Dashboard

Agent Eval Workflow (v0.8.0+ - agentic coding)

1. Run Agent Eval

2. Find Chain ID

3. Analyze Results

4. Compare Runs

5. Generate Report

Natural Language Interface

Architecture

Target KPIs

Development Cycle

See Also