Skip to main content

Local Ollama Eval

Running AILANG agent-mode evaluations against local Ollama models — typically gemma4:26b — on a dedicated Mac Studio (128 GB unified memory, M4 Max). Companion to model-configuration.md which covers cloud OS models routed via OpenRouter.

This page reflects the M-EVAL-LOCAL-OLLAMA + M-EVAL-LOCAL-OBSERVABILITY milestones (v0.22.0).

TL;DR — canonical commands

After one-time setup (below), the rotation runs via:

# Smoke tier (17 benchmarks, ~110 min wall at p=2, requires ~50 GB memory headroom)
make eval-smoke \
MODELS=opencode-gemma4-26b \
EXTRA='-agent -langs ailang \
-benchmarks fizzbuzz,adt_option,balanced_parens,binary_tree_sum,canonical_convergence,canonical_normalization,dense_operator_program,explicit_state_threading,gcd_lcm,immutable_data_structures,inline_tests,nested_records,numeric_modulo,record_update,records_book,recursion_fibonacci,type_safe_record_access \
-output eval_results/rotation/$(date +%Y-%m-%d)/$(date +%H%M)_gemma4-26b_smoke \
-parallel 2 \
-agent-timeout 2400'

# Watch progress live
ailang chains live $(ailang chains list --limit 1 --since 5m | tail -1 | awk '{print $1}')

One-time setup

1. Install prerequisites

# Go (for building ailang)
brew install go

# node + npm (for opencode and pi CLIs)
brew install node

# Ollama (the model runtime)
brew install ollama
ollama serve & # or start the app from Applications

# opencode CLI (agent-mode harness used by AILANG)
npm install -g opencode-ai
opencode --version # confirm 1.15.7 or newer

# Build and install ailang itself
cd $REPO
make install

2. Pull the model

ollama pull gemma4:26b
ollama show gemma4:26b

Expected: 25.8B params, Q4_K_M, 17 GB on disk, 25.76 GB resident VRAM, 262k context.

3. Configure opencode's Ollama provider

~/.config/opencode/opencode.jsonc:

{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama Local",
"options": { "baseURL": "http://localhost:11434/v1" },
"models": {
"gemma4:26b": { "name": "Gemma 4 26B (local)" }
}
}
}
}

Verify with opencode models | grep ollama — should print ollama/gemma4:26b.

# Set in your shell init or via launchctl setenv if running Ollama.app
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1 # one model resident at a time
launchctl setenv OLLAMA_NUM_PARALLEL 4 # 4 concurrent requests/model
launchctl setenv OLLAMA_MAX_QUEUE 64 # back-pressure threshold

Restart Ollama for these to take effect.

5. Start the AILANG observability server

# Manual (for development)
make services-start
curl -s http://localhost:1957/health

# Persistent (for 24/7 rotation)
cp tools/launchd/dev.ailang.server.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/dev.ailang.server.plist
launchctl list | grep dev.ailang.server # confirm it's loaded

6. Enable OTLP export from eval-suite

Add to your shell init (.zshrc / .bashrc):

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:1957

This routes per-step OTEL spans from each opencode subprocess into the local observatory.db (~/.ailang/state/observatory.db). Required for ailang chains live to show per-stage progress. Without this set, eval runs still complete but you get no live monitoring.

How parallelism behaves on M4 Max

128 GB unified memory + 40 GPU cores; gemma4:26b uses 25.76 GB of unified memory. Empirical measurements (2026-05-22 small-N, then 2026-05-23 17-benchmark head-to-head):

-parallel NPer-bench wall17-bench wall17-bench passNote
1 (serial)4–7 min (variance dominates)not measuredn/aBaseline; estimated ~3h
2~7 min1h 49m13/17 = 76.5%Recommended default
4~15 min1h 39m10/17 = 58.8%Slightly faster wall, but loses 2 benchmarks to TTFT timeouts under prefill contention
8+not testedWon't beat 4 unless OLLAMA_NUM_PARALLEL is raised

-parallel 2 is the recommended default for the 24/7 rotation. Earlier small-N (1–2 benchmarks) suggested p=4 might help stability via token-rate throttling, but the 17-benchmark head-to-head on 2026-05-23 shows p=4's TTFT contention loses 2 benchmarks (fizzbuzz, inline_tests) before they emit a single token — outweighing any anti-thrash benefit. p=2 wins +17.6 percentage points on pass rate at +10 minutes wall clock.

Variance warning — single-trial pass rates swing 5–7 benchmarks across consecutive runs of the same model on the same seed. For trustworthy assessment use N≥3 trials. See M-EVAL-OS-LONGITUDINAL for the --trials N flag design.

Per-model config in models.yml

The relevant entry (already in internal/eval_harness/models.yml):

opencode-gemma4-26b:
api_name: "gemma4:26b"
provider: "ollama"
agent_cli: "opencode"
agent_model_name: "ollama/gemma4:26b"
max_output_tokens: 8192
ttft_timeout: 900 # 15 min — local thinking + p=4 contention
generation_timeout: 1200 # 20 min — opencode per-session hard cap
budgets:
hard_timeout_secs: 2400 # 40 min — wall-clock safety net
pricing:
input_per_1k: 0.0
output_per_1k: 0.0

Two design choices worth knowing:

  1. pricing: 0 means cost gate is unused — wall-clock is the only cap.
  2. budgets.hard_timeout_secs wins over benchmark-spec timeout: fields (M-EVAL-LOCAL-OLLAMA precedence fix). Local thinking models can iterate long even on benchmarks that have a cloud-tuned timeout: 90s.

Live monitoring

While a run is in flight:

# Find the active chain (most recent)
ailang chains list --since 5m --limit 1

# Live view with 3-second refresh (default)
ailang chains live <chain-id>

# Faster refresh
ailang chains live <chain-id> --interval 1

# Single render then exit (useful in scripts)
ailang chains live <chain-id> --once

Output:

Chain: c68f0cc6 Source: eval_suite Status: active Elapsed: 12m
Ollama: gemma4:26b (VRAM 25.76 GB)
────────────────────────────────────────────────────────────────────────────────
# Benchmark / Agent Status Turns Tokens Last span
────────────────────────────────────────────────────────────────────────────────
1 eval-agent:fizzbuzz running 12 47K 3s ago
2 eval-agent:adt_option running 8 31K 12s ago
3 eval-agent:balanced_parens running 0 0 ⚠ 540s ago (stuck?)
4 eval-agent:recursion_fib running 14 52K 1s ago
────────────────────────────────────────────────────────────────────────────────

The ⚠ stuck? indicator fires when the most recent span for a stage is >300s old AND status is still running. Local thinking models can spend minutes in pure reasoning before emitting visible output — see if Ollama: header still shows the model with non-zero VRAM and check ollama runner CPU% (ps aux | grep "ollama runner"). If runner CPU is >20%, the model is generating; if <1%, it really is stuck.

Troubleshooting

Symptom: observatory.db stays at 0 spans

  • Verify OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:1957 is set in the shell where you launched eval-suite (not just in your config). Confirm with ps eww -p <pid> listing the var.
  • Verify server is up: curl -s http://localhost:1957/health.
  • Check ~/.ailang/logs/server.log for FOREIGN KEY constraint failed (should be absent post-M-EVAL-LOCAL-OBSERVABILITY M1).

Symptom: benchmarks fail with "opencode produced no output within 8m0s (prefill timeout)"

  • The ttft_timeout is too tight for your -parallel N level. At p=4 we observed prefill needing ~12 min for some benchmarks. Bump opencode-gemma4-26b.ttft_timeout in models.yml.

Symptom: high token thrashing (>1M tokens for a simple benchmark)

  • Try raising -parallel from 2 to 4. The reduced per-request token rate acts as a "think before emitting" governor that suppresses runaway loops.
  • This is expected behavior for some benchmarks (e.g. dense_operator_program consistently thrashes regardless of config — a real model gap).

Symptom: "non-agentic result: 1 turns, 0 tool calls"

  • The model decided to one-shot the answer instead of using tools. opencode rejects this as not-agentic. Currently no clean workaround beyond re-running. See M-EVAL-LOCAL-OBSERVABILITY notes for "make non-agentic a warning not error" deferred work.

Symptom: ailang chains live shows "(no spans yet)" for every stage even mid-run

  • Spans are landing but not joined to stages. This is a known follow-up: per-stage chain_id/stage_id resource attrs need to be added at the eval-suite OTLP-resource layer. Use ailang chains diagnose or sqlite3 ~/.ailang/state/observatory.db 'SELECT COUNT(*) FROM spans' to confirm spans are still flowing.