Skip to main content

Agent Harness Setup

AILANG's agent eval mode runs benchmarks through agentic CLI tools — the same tools developers use interactively. This guide covers how to install and authenticate each supported harness for ailang eval-suite --agent.

Supported Harnesses

HarnessCLI toolModels in models.ymlInstall
claudeClaude Code (claude)claude-sonnet-4-6, claude-haiku-4-5npm install -g @anthropic-ai/claude-code
managed_agents(HTTP API, no CLI)gemini-3-5-flashgcloud auth application-default login
codexOpenAI Codex CLI (codex)gpt5-4, gpt5-1-instantnpm install -g @openai/codex
opencodeopencode (opencode)opencode-haiku, opencode-sonnet-4-6, opencode-gemini-3-flashnpm install -g opencode-ai

Note (v0.22.0, M-MANAGED-AGENTS): The legacy gemini CLI executor was retired. Google deprecates Gemini CLI on 2026-06-18 and v0.42 has a stale model allowlist (no gemini-3-5-flash). Replaced by the managed_agents executor, which calls the Vertex AI Managed Agents API directly via ADC. Older Gemini models (2.5, 3, 3.1) lose agent-mode coverage but keep standard-mode via direct Vertex generateContent.

Quick Check

claude --version
gcloud auth application-default print-access-token # for managed_agents
codex --version
opencode --version

Claude Code (claude)

npm install -g @anthropic-ai/claude-code
export ANTHROPIC_API_KEY=sk-ant-...
claude --version

Verify agentic mode works:

echo "Write hello world to solution.py" | claude --print \
--output-format stream-json --permission-mode bypassPermissions

The --permission-mode bypassPermissions flag is what the executor uses to auto-approve file edits. If you see JSON events with "type":"tool_use" the harness is working.

Managed Agents API (managed_agents)

The Managed Agents executor calls the Vertex AI Managed Agents endpoint (aiplatform.googleapis.com/v1beta1/.../interactions) directly via HTTP using Application Default Credentials. There is no local CLI — the agent runs in a Google-hosted Linux sandbox per interaction, with full tool execution + multi-turn state managed server-side.

Setup

# 1. Authenticate via ADC (same flow as direct Vertex generateContent calls)
gcloud auth application-default login

# 2. Set the default project (or set Task.GCPProject per-call via models.yml)
gcloud config set project ailang-dev

# 3. First call to a fresh project provisions the service (HTTP 400
# "Provisioning is in progress" for ~3 min, then ready). The executor's
# error message includes this hint.

Verify

ACCESS_TOKEN=$(gcloud auth application-default print-access-token)
curl -sN -X POST \
"https://aiplatform.googleapis.com/v1beta1/projects/ailang-dev/locations/global/interactions" \
-H "Authorization: Bearer $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-H "Api-Revision: 2026-05-20" \
-d '{
"stream": true, "background": true, "store": true,
"agent": "antigravity-preview-05-2026",
"environment": {"type": "remote"},
"input": [{"type":"user_input","content":[{"type":"text","text":"reply PONG"}]}]
}'

Successful output is an SSE stream ending with event: interaction.completed followed by data: [DONE].

Cross-environment file bridge

Because the agent runs in a remote sandbox, file edits the agent makes do NOT touch the local workspace. The eval harness handles this by:

  1. Appending an instruction to the system prompt that tells the agent to dump its complete solution as a fenced code block at the end of its response
  2. Extracting that fenced block from Result.Output and writing it to <workspace>/benchmark/solution.ail after the run

This is automatic — handled by eval_harness/managed_agents_bridge.go for any executor that advertises executor.CapRemoteSandbox. Other backend callers that don't need file bridging (e.g. plain reasoning queries) get a policy-free executor.

Limits

  • No multi-turn yet. Each Execute() provisions a fresh sandbox.
  • Region locked to global. No regional Vertex endpoints for this API yet.
  • Api-Revision: 2026-05-20 header pinned in the executor — guards against schema drift. Bump when Google publishes a new revision.
  • Cost: $1.50/$9.00 per 1M (Vertex gemini-3.5-flash pricing).

OpenAI Codex CLI (codex)

npm install -g @openai/codex
export OPENAI_API_KEY=sk-...
codex --version

The executor uses codex exec --json --model <model> --dangerously-bypass-approvals-and-sandbox.

Verify:

echo "Write hello world to solution.py" | codex exec --json \
--model gpt-5.4 --dangerously-bypass-approvals-and-sandbox

You should see NDJSON events including thread.started, turn.started, item.completed with type: "file_change", and turn.completed with usage stats.

Note: Codex CLI v0.1+ uses the thread/item event format. Older versions used a flat message/tool_use format. The AILANG executor handles both.

opencode (opencode)

opencode is a multi-provider gateway that supports Anthropic, OpenAI, Google Vertex, and local Ollama models through a single CLI.

npm install -g opencode-ai
opencode --version # e.g. 1.14.20

Provider Authentication

Each provider opencode talks to needs credentials:

ProviderSetup
Anthropicexport ANTHROPIC_API_KEY=sk-ant-...
OpenAIexport OPENAI_API_KEY=sk-...
Google Vertexgcloud auth application-default login
Ollama (local)ollama serve running; no key needed

Model String Format

opencode uses provider/model strings — not bare model names:

anthropic/claude-haiku-4-5 # Anthropic
openai/gpt-5.4 # OpenAI
google-vertex/gemini-3-flash-preview # Google Vertex AI
ollama/gemma4:latest # Local Ollama

Important: Google models require the google-vertex/ prefix. google/ is not a registered provider and causes ProviderModelNotFoundError. Run opencode models google-vertex to list available model IDs.

To discover all available providers and models:

opencode models # all providers
opencode models anthropic # Anthropic models only
opencode models google-vertex # Google Vertex models

Verify opencode Works

cd /tmp && mkdir oc_test && cd oc_test
echo "Write hello world to solution.py" | opencode run \
--format json --dangerously-skip-permissions \
--model anthropic/claude-haiku-4-5

You should see NDJSON events with "type":"tool_use" for file writes.

Local Models via Ollama

opencode can route to local Ollama models with a custom provider config at ~/.config/opencode/opencode.jsonc:

{
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama Local",
"options": { "baseURL": "http://localhost:11434/v1" },
"models": {
"gemma4:latest": { "name": "Gemma 4" },
"gemma3:4b": { "name": "Gemma 3 4B" }
}
}
}
}

Then use ollama/gemma4:latest as the model string, or add an entry to models.yml pointing at opencode-gemma4 with agent_cli: "opencode" and agent_model_name: "ollama/gemma4:latest".

See internal/executor/opencode/testdata/opencode_ollama_config.jsonc for a complete config example.

Running a Cross-Harness Smoke Eval

Once all harnesses are installed and authenticated, run the cross-harness comparison:

# Dry run to confirm 5 models × 3 benchmarks × 2 languages = 30 runs
ailang eval-suite --agent --models harness_suite \
--benchmarks fizzbuzz,gcd_lcm,balanced_parens \
--langs ailang,python --dry-run

# Full run (5 parallel agent sessions)
ailang eval-suite --agent --models harness_suite \
--benchmarks fizzbuzz,gcd_lcm,balanced_parens \
--langs ailang,python --agent-parallel 5

harness_suite expands to:

  • claude-sonnet-4-6 → claude harness
  • opencode-sonnet-4-6 → opencode harness (Anthropic backend)
  • gemini-3-flash → gemini harness
  • opencode-gemini-3-flash → opencode harness (Google Vertex backend)
  • gpt5-4 → codex harness

This gives Δ delta comparison between same-model, different-harness pairs (Sonnet via claude vs opencode; Flash via gemini vs opencode). Results appear in /docs/benchmarks/by-harness once ailang eval-report --format=json is re-run.

Troubleshooting

"non-agentic result: 0 turns, 0 tool calls"

The executor ran but the agent produced 0 tool calls — it either:

  • Printed an answer directly instead of writing a file (0-shot behavior)
  • Failed to auth (no key / expired token) and exited immediately
  • Used the wrong model string (google/ instead of google-vertex/ for opencode)

Run the verify command for that harness above and check the raw event output.

Codex: "openai: 401 Unauthorized"

OPENAI_API_KEY is not set or expired. Check echo $OPENAI_API_KEY.

Gemini: binary not in PATH

NVM issue. Add the node bin dir to PATH: see the Gemini section above.

opencode-gemini: "ProviderModelNotFoundError"

You're using google/... instead of google-vertex/.... Check agent_model_name in models.yml.