terminal-bench on RTX 3090 — runs

An iterative attempt to make local Qwen models (llama.cpp on a single RTX 3090, no external LLM providers) perform well on terminal-bench 2.0 by tuning the agent harness and the server config, one measured run at a time. Each run below links to a detail page with what was tested, the results, an interpretation and the next steps — read the story top to bottom to follow the iterations.

All runs

runmodelthinkingtrialsnotesmean
smoke__qwen3.6-35b-a3b__20260704-121319llama-local/qwen3.6-35b-a3bon30.33
suite__qwen3.6-35b-a3b__20260703-003556llama-local/qwen3.6-35b-a3bon890.37
smoke__qwen3.6-35b-a3b__20260702-213821llama-local/qwen3.6-35b-a3bon200.75
smoke__qwen3.6-35b-a3b__20260702-191746llama-local/qwen3.6-35b-a3bon300.77
smoke__qwen3.6-27b__20260702-190304llama-local/qwen3.6-27bon41.00
smoke__qwen3.6-35b-a3b__20260702-181612llama-local/qwen3.6-35b-a3bon100.80
smoke__qwen3.6-35b-a3b__20260702-180115llama-local/qwen3.6-35b-a3bon11.00
cmpthink-on__qwen3.6-35b-a3b__20260702-172246llama-local/qwen3.6-35b-a3bon40.75
cmpthink-off__qwen3.6-35b-a3b__20260702-171738llama-local/qwen3.6-35b-a3boff40.75
smoke__qwen3.6-35b-a3b__20260702-164851llama-local/qwen3.6-35b-a3b(none)40.75
smoke__qwen3.6-35b-a3b__20260702-163559llama-local/qwen3.6-35b-a3b(none)41.00

Analysis reports

The story so far

Chronological — each entry is one run's headline; click through for the full notes (what changed, results, interpretation, next steps).

Journal (mechanism & notes)

Runs journal

Human-readable notes on the benchmark runs under runs/. The per-trial JSON records the model, thinking, max_tokens, llama_port, rewards and tokens — but not which reasoning-budget mechanism was in effect. That's the whole point of this file.

Per-run narrative lives in runs/<job>/NOTES.md (what was run & changed, results, interpretation, next steps) — rendered on each run's page of the Pages site, with the first # heading used as the run's timeline headline. This file stays the cross-run journal: mechanism history and how to read old runs.

How to read a run (things not in the logs)

⚠️ Reading runs from here on

Future runs use the server flag, so they will show llama_port=8020 and thinking=on — which looks identical in the logs to the old uncapped port-8020 runs (e.g. cmpthink-on). The difference (budget enforced server-side) is invisible to the trial JSON. Tell them apart by date (anything after the 2026-07-02 server restart has --reasoning-budget 4000) and by this journal. The budget value lives in the llama-swap config's git history, not in these logs — if you change it, add a row below noting the new value.

Run log (all qwen3.6-35b-a3b MoE unless noted)

Job (runs/…) Reasoning budget thinking tasks × K result notes
smoke__…163559 none (uncontrolled) (none) 4 × 1 4/4 first smoke; regex-log happened to terminate
smoke__…164851 none (uncontrolled) (none) 4 × 1 3/4 exposed the loop: regex-log burned 32k tok on reasoning, failed
cmpthink-off__…171738 none off 4 × 1 3/4 thinking OFF: regex-log passes, openssl fails
cmpthink-on__…172246 none on 4 × 1 3/4 thinking ON uncapped: regex-log loops/fails, openssl passes
smoke__…180115 proxy 4000 on regex-log × 1 1/1 proxy validation; cap fired 2×
smoke__…181612 proxy 4000 on 2 × 5 (=10) 0.80 regex-log 5/5, openssl 3/5
smoke__…191732 proxy 4000 on regex-log × 1 errored single errored trial (ignore)
smoke__…191746 proxy 4000 on 2 × 15 (=30) 0.77 regex-log 14/15 (93%), openssl 9/15 (60%)
smoke__qwen3.6-27b__190304 proxy 4000 on 4 × 1 4/4 dense 27B model through the proxy
smoke__…213821 server flag 4000 on 2 × 10 (=20) 0.75 first confirmed server-flag run (--reasoning-budget 4000, port 8020). regex-log 8/10, openssl 7/10. Parity with proxy confirmed — see below
suite__…20260703-003556 server flag 4000 on 89 × 1 0.371 first full suite, ~23.5 h. See its NOTES.md + the two ANALYSIS-*.md reports
smoke__…20260704-121319 server flag 4000 on 3 × 1 1/3 first context-fix run (context_window=229376 + COMPACT_AT_TOKENS=200000 guard): guard active in every trial but never fired (peaks 51–165k), zero hard 400s; both former 400-victims now clean-stop near-misses — trajectory variance, not the fix. See its NOTES.md

What the proxy-4000 runs showed (181612 + 191746)

Conclusion: 4000 fixes the loop without hurting the short task. It hasn't been tuned on breadth — a proper sweep (2000 / 4000 / 8000) should run against the full suite, since different tasks stress reasoning length differently. And these were proxy runs; re-verify parity now that the server flag is live.

Server-flag parity — verified on 213821

This is the first run on the native server budget (--reasoning-budget 4000 in the llama-swap config), replacing the proxy. Because a server-flag run is indistinguishable from an old uncapped port-8020 run in the trial JSON, parity was verified directly from the transcripts (agent/pi.txt):

Verdict: proxy → server flag is a clean swap — same bounding, same scores, no extra process. 4000 stays as the value pending a full-suite breadth sweep. To re-check on any future run: max thinking-block tokens should stay well under 4000 and stopReason: length count should be 0.