An iterative attempt to make local Qwen models (llama.cpp on a single RTX 3090, no external LLM providers) perform well on terminal-bench 2.0 by tuning the agent harness and the server config, one measured run at a time. Each run below links to a detail page with what was tested, the results, an interpretation and the next steps — read the story top to bottom to follow the iterations.
| run | model | thinking | trials | notes | mean |
|---|---|---|---|---|---|
| smoke__qwen3.6-35b-a3b__20260704-121319 | llama-local/qwen3.6-35b-a3b | on | 3 | ✓ | 0.33 |
| suite__qwen3.6-35b-a3b__20260703-003556 | llama-local/qwen3.6-35b-a3b | on | 89 | ✓ | 0.37 |
| smoke__qwen3.6-35b-a3b__20260702-213821 | llama-local/qwen3.6-35b-a3b | on | 20 | ✓ | 0.75 |
| smoke__qwen3.6-35b-a3b__20260702-191746 | llama-local/qwen3.6-35b-a3b | on | 30 | ✓ | 0.77 |
| smoke__qwen3.6-27b__20260702-190304 | llama-local/qwen3.6-27b | on | 4 | ✓ | 1.00 |
| smoke__qwen3.6-35b-a3b__20260702-181612 | llama-local/qwen3.6-35b-a3b | on | 10 | ✓ | 0.80 |
| smoke__qwen3.6-35b-a3b__20260702-180115 | llama-local/qwen3.6-35b-a3b | on | 1 | ✓ | 1.00 |
| cmpthink-on__qwen3.6-35b-a3b__20260702-172246 | llama-local/qwen3.6-35b-a3b | on | 4 | ✓ | 0.75 |
| cmpthink-off__qwen3.6-35b-a3b__20260702-171738 | llama-local/qwen3.6-35b-a3b | off | 4 | ✓ | 0.75 |
| smoke__qwen3.6-35b-a3b__20260702-164851 | llama-local/qwen3.6-35b-a3b | (none) | 4 | ✓ | 0.75 |
| smoke__qwen3.6-35b-a3b__20260702-163559 | llama-local/qwen3.6-35b-a3b | (none) | 4 | ✓ | 1.00 |
Chronological — each entry is one run's headline; click through for the full notes (what changed, results, interpretation, next steps).
Human-readable notes on the benchmark runs under runs/. The per-trial JSON
records the model, thinking, max_tokens, llama_port, rewards and tokens —
but not which reasoning-budget mechanism was in effect. That's the whole
point of this file.
Per-run narrative lives in
runs/<job>/NOTES.md(what was run & changed, results, interpretation, next steps) — rendered on each run's page of the Pages site, with the first# headingused as the run's timeline headline. This file stays the cross-run journal: mechanism history and how to read old runs.
llama_port in the trial config tells you the path:8020 = the agent talked directly to llama-swap.8021 = the agent went through the reasoning-cap proxy (a since-removed
host-side proxy that capped reasoning at ~4000 tok/turn and forced an
answer). Any port-8021 run therefore had an effective reasoning budget of 4000.thinking = (none) in the config means the run predates the thinking
control (the early reasoning: false extension bug): reasoning was on but
uncontrolled, so it could loop and burn the whole budget.llama_port=8021.
3. server flag — llama.cpp native --reasoning-budget 4000 set in the
llama-swap config (llama_cpp_on_my_desktop, committed 2026-07-02, server
restarted the same day). This is the current & future mechanism.Future runs use the server flag, so they will show llama_port=8020 and
thinking=on — which looks identical in the logs to the old uncapped
port-8020 runs (e.g. cmpthink-on). The difference (budget enforced
server-side) is invisible to the trial JSON. Tell them apart by date
(anything after the 2026-07-02 server restart has --reasoning-budget 4000) and
by this journal. The budget value lives in the llama-swap config's git history,
not in these logs — if you change it, add a row below noting the new value.
| Job (runs/…) | Reasoning budget | thinking | tasks × K | result | notes |
|---|---|---|---|---|---|
smoke__…163559 |
none (uncontrolled) | (none) | 4 × 1 | 4/4 | first smoke; regex-log happened to terminate |
smoke__…164851 |
none (uncontrolled) | (none) | 4 × 1 | 3/4 | exposed the loop: regex-log burned 32k tok on reasoning, failed |
cmpthink-off__…171738 |
none | off | 4 × 1 | 3/4 | thinking OFF: regex-log passes, openssl fails |
cmpthink-on__…172246 |
none | on | 4 × 1 | 3/4 | thinking ON uncapped: regex-log loops/fails, openssl passes |
smoke__…180115 |
proxy 4000 | on | regex-log × 1 | 1/1 | proxy validation; cap fired 2× |
smoke__…181612 |
proxy 4000 | on | 2 × 5 (=10) | 0.80 | regex-log 5/5, openssl 3/5 |
smoke__…191732 |
proxy 4000 | on | regex-log × 1 | errored | single errored trial (ignore) |
smoke__…191746 |
proxy 4000 | on | 2 × 15 (=30) | 0.77 | regex-log 14/15 (93%), openssl 9/15 (60%) |
smoke__qwen3.6-27b__190304 |
proxy 4000 | on | 4 × 1 | 4/4 | dense 27B model through the proxy |
smoke__…213821 |
server flag 4000 | on | 2 × 10 (=20) | 0.75 | first confirmed server-flag run (--reasoning-budget 4000, port 8020). regex-log 8/10, openssl 7/10. Parity with proxy confirmed — see below |
suite__…20260703-003556 |
server flag 4000 | on | 89 × 1 | 0.371 | first full suite, ~23.5 h. See its NOTES.md + the two ANALYSIS-*.md reports |
smoke__…20260704-121319 |
server flag 4000 | on | 3 × 1 | 1/3 | first context-fix run (context_window=229376 + COMPACT_AT_TOKENS=200000 guard): guard active in every trial but never fired (peaks 51–165k), zero hard 400s; both former 400-victims now clean-stop near-misses — trajectory variance, not the fix. See its NOTES.md |
regex-log — which loops and fails uncapped — passed
~95% (19/20) with the cap. That jump is the evidence.openssl-selfsigned-cert ~60% is NOT a budget issue. Its entire trajectory
is only ~1.7k–3.7k output tokens, so per-turn reasoning never reaches the 4000
cap — the budget never fires for it. 60% is the model's baseline on that task.Conclusion: 4000 fixes the loop without hurting the short task. It hasn't been tuned on breadth — a proper sweep (2000 / 4000 / 8000) should run against the full suite, since different tasks stress reasoning length differently. And these were proxy runs; re-verify parity now that the server flag is live.
This is the first run on the native server budget (--reasoning-budget 4000
in the llama-swap config), replacing the proxy. Because a server-flag run is
indistinguishable from an old uncapped port-8020 run in the trial JSON, parity
was verified directly from the transcripts (agent/pi.txt):
thinking block per trial: max ≈3,141
tok (median ~146), vs ~20,000 tok in the uncapped runs (164851,
cmpthink-on 172246) and ~4,001 tok under the proxy. No reasoning block gets
near the 32k maxTokens ceiling. (The server tops out below 4000 because
llama.cpp nudges the model to close </think> as it nears the budget, rather
than the proxy's hard splice exactly at 4000.)stopReason: length — every turn ended cleanly (toolUse/stop). The
old uncapped 164851 had a length stop; that failure mode is gone.Verdict: proxy → server flag is a clean swap — same bounding, same scores,
no extra process. 4000 stays as the value pending a full-suite breadth sweep.
To re-check on any future run: max thinking-block tokens should stay well under
4000 and stopReason: length count should be 0.