Local Ollama Eval
Running AILANG agent-mode evaluations against local Ollama models — typically gemma4:26b — on a dedicated Mac Studio (128 GB unified memory, M4 Max). Companion to model-configuration.md which covers cloud OS models routed via OpenRouter.
This page reflects the M-EVAL-LOCAL-OLLAMA + M-EVAL-LOCAL-OBSERVABILITY milestones (v0.22.0).
TL;DR — canonical commands
After one-time setup (below), the rotation runs via:
# Smoke tier (17 benchmarks, ~110 min wall at p=2, requires ~50 GB memory headroom)
make eval-smoke \
MODELS=opencode-gemma4-26b \
EXTRA='-agent -langs ailang \
-benchmarks fizzbuzz,adt_option,balanced_parens,binary_tree_sum,canonical_convergence,canonical_normalization,dense_operator_program,explicit_state_threading,gcd_lcm,immutable_data_structures,inline_tests,nested_records,numeric_modulo,record_update,records_book,recursion_fibonacci,type_safe_record_access \
-output eval_results/rotation/$(date +%Y-%m-%d)/$(date +%H%M)_gemma4-26b_smoke \
-parallel 2 \
-agent-timeout 2400'
# Watch progress live
ailang chains live $(ailang chains list --limit 1 --since 5m | tail -1 | awk '{print $1}')
One-time setup
1. Install prerequisites
# Go (for building ailang)
brew install go
# node + npm (for opencode and pi CLIs)
brew install node
# Ollama (the model runtime)
brew install ollama
ollama serve & # or start the app from Applications
# opencode CLI (agent-mode harness used by AILANG)
npm install -g opencode-ai
opencode --version # confirm 1.15.7 or newer
# Build and install ailang itself
cd $REPO
make install
2. Pull the model
ollama pull gemma4:26b
ollama show gemma4:26b
Expected: 25.8B params, Q4_K_M, 17 GB on disk, 25.76 GB resident VRAM, 262k context.
3. Configure opencode's Ollama provider
~/.config/opencode/opencode.jsonc:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama Local",
"options": { "baseURL": "http://localhost:11434/v1" },
"models": {
"gemma4:26b": { "name": "Gemma 4 26B (local)" }
}
}
}
}
Verify with opencode models | grep ollama — should print ollama/gemma4:26b.
4. Configure Ollama parallelism (recommended)
# Set in your shell init or via launchctl setenv if running Ollama.app
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1 # one model resident at a time
launchctl setenv OLLAMA_NUM_PARALLEL 4 # 4 concurrent requests/model
launchctl setenv OLLAMA_MAX_QUEUE 64 # back-pressure threshold
Restart Ollama for these to take effect.
5. Start the AILANG observability server
# Manual (for development)
make services-start
curl -s http://localhost:1957/health
# Persistent (for 24/7 rotation)
cp tools/launchd/dev.ailang.server.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/dev.ailang.server.plist
launchctl list | grep dev.ailang.server # confirm it's loaded
6. Enable OTLP export from eval-suite
Add to your shell init (.zshrc / .bashrc):
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:1957
This routes per-step OTEL spans from each opencode subprocess into the local
observatory.db (~/.ailang/state/observatory.db). Required for
ailang chains live to show per-stage progress. Without this set, eval
runs still complete but you get no live monitoring.
How parallelism behaves on M4 Max
128 GB unified memory + 40 GPU cores; gemma4:26b uses 25.76 GB of unified memory. Empirical measurements (2026-05-22 small-N, then 2026-05-23 17-benchmark head-to-head):
-parallel N | Per-bench wall | 17-bench wall | 17-bench pass | Note |
|---|---|---|---|---|
| 1 (serial) | 4–7 min (variance dominates) | not measured | n/a | Baseline; estimated ~3h |
| 2 | ~7 min | 1h 49m | 13/17 = 76.5% ✅ | Recommended default |
| 4 | ~15 min | 1h 39m | 10/17 = 58.8% | Slightly faster wall, but loses 2 benchmarks to TTFT timeouts under prefill contention |
| 8+ | not tested | — | — | Won't beat 4 unless OLLAMA_NUM_PARALLEL is raised |
-parallel 2 is the recommended default for the 24/7 rotation. Earlier
small-N (1–2 benchmarks) suggested p=4 might help stability via token-rate
throttling, but the 17-benchmark head-to-head on 2026-05-23 shows p=4's
TTFT contention loses 2 benchmarks (fizzbuzz, inline_tests) before they
emit a single token — outweighing any anti-thrash benefit. p=2 wins +17.6
percentage points on pass rate at +10 minutes wall clock.
Variance warning — single-trial pass rates swing 5–7 benchmarks across consecutive runs of the same model on the same seed. For trustworthy assessment use N≥3 trials. See M-EVAL-OS-LONGITUDINAL for the
--trials Nflag design.
Per-model config in models.yml
The relevant entry (already in internal/eval_harness/models.yml):
opencode-gemma4-26b:
api_name: "gemma4:26b"
provider: "ollama"
agent_cli: "opencode"
agent_model_name: "ollama/gemma4:26b"
max_output_tokens: 8192
ttft_timeout: 900 # 15 min — local thinking + p=4 contention
generation_timeout: 1200 # 20 min — opencode per-session hard cap
budgets:
hard_timeout_secs: 2400 # 40 min — wall-clock safety net
pricing:
input_per_1k: 0.0
output_per_1k: 0.0
Two design choices worth knowing:
pricing: 0means cost gate is unused — wall-clock is the only cap.budgets.hard_timeout_secswins over benchmark-spectimeout:fields (M-EVAL-LOCAL-OLLAMA precedence fix). Local thinking models can iterate long even on benchmarks that have a cloud-tunedtimeout: 90s.
Live monitoring
While a run is in flight:
# Find the active chain (most recent)
ailang chains list --since 5m --limit 1
# Live view with 3-second refresh (default)
ailang chains live <chain-id>
# Faster refresh
ailang chains live <chain-id> --interval 1
# Single render then exit (useful in scripts)
ailang chains live <chain-id> --once
Output:
Chain: c68f0cc6 Source: eval_suite Status: active Elapsed: 12m
Ollama: gemma4:26b (VRAM 25.76 GB)
────────────────────────────────────────────────────────────────────────────────
# Benchmark / Agent Status Turns Tokens Last span
────────────────────────────────────────────────────────────────────────────────
1 eval-agent:fizzbuzz running 12 47K 3s ago
2 eval-agent:adt_option running 8 31K 12s ago
3 eval-agent:balanced_parens running 0 0 ⚠ 540s ago (stuck?)
4 eval-agent:recursion_fib running 14 52K 1s ago
────────────────────────────────────────────────────────────────────────────────
The ⚠ stuck? indicator fires when the most recent span for a stage is
>300s old AND status is still running. Local thinking models can spend
minutes in pure reasoning before emitting visible output — see if Ollama:
header still shows the model with non-zero VRAM and check ollama runner
CPU% (ps aux | grep "ollama runner"). If runner CPU is >20%, the model is
generating; if <1%, it really is stuck.
Troubleshooting
Symptom: observatory.db stays at 0 spans
- Verify
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:1957is set in the shell where you launched eval-suite (not just in your config). Confirm withps eww -p <pid>listing the var. - Verify server is up:
curl -s http://localhost:1957/health. - Check
~/.ailang/logs/server.logforFOREIGN KEY constraint failed(should be absent post-M-EVAL-LOCAL-OBSERVABILITY M1).
Symptom: benchmarks fail with "opencode produced no output within 8m0s (prefill timeout)"
- The
ttft_timeoutis too tight for your-parallel Nlevel. At p=4 we observed prefill needing ~12 min for some benchmarks. Bumpopencode-gemma4-26b.ttft_timeoutin models.yml.
Symptom: high token thrashing (>1M tokens for a simple benchmark)
- Try raising
-parallelfrom 2 to 4. The reduced per-request token rate acts as a "think before emitting" governor that suppresses runaway loops. - This is expected behavior for some benchmarks (e.g.
dense_operator_programconsistently thrashes regardless of config — a real model gap).
Symptom: "non-agentic result: 1 turns, 0 tool calls"
- The model decided to one-shot the answer instead of using tools. opencode rejects this as not-agentic. Currently no clean workaround beyond re-running. See M-EVAL-LOCAL-OBSERVABILITY notes for "make non-agentic a warning not error" deferred work.
Symptom: ailang chains live shows "(no spans yet)" for every stage even
mid-run
- Spans are landing but not joined to stages. This is a known follow-up:
per-stage
chain_id/stage_idresource attrs need to be added at the eval-suite OTLP-resource layer. Useailang chains diagnoseorsqlite3 ~/.ailang/state/observatory.db 'SELECT COUNT(*) FROM spans'to confirm spans are still flowing.
Related
- M-EVAL-LOCAL-OLLAMA design doc
- M-EVAL-LOCAL-OBSERVABILITY design doc
- model-configuration.md — cloud OS models via OpenRouter
- Ollama library — full model catalog