Agent Harness Explorer

Agent mode only — these results come from multi-turn agentic coding sessions (Claude CLI, Gemini CLI, opencode, Codex). For 0-shot API and self-repair results, see Benchmarks.

Browse by language, harness, and model. The cross-harness comparison shows what happens when the same underlying model runs through a different CLI.

This view aggregates two sources: the cloud release/nightly baseline, and the continuous on-device rotation (local Ollama models on opencode and Pi, $0/run) which refreshes between releases. On-device models are tagged on-device and let you compare, for example, the same local Qwen across the opencode and Pi harnesses.

Loading benchmark data…

Local-rig version trend

Does each AILANG release move the needle for local models? This tracks the on-device rotation (opencode + Pi + motoko on local Qwen, $0/run) per AILANG release — columns are AILANG versions, newest on the right. Retired models freeze at their last version; active models re-run each release.

Loading version trend…

Local-rig version trend​

Local-rig version trend