terminal-bench on RTX 3090 — runs

An iterative attempt to make local Qwen models (llama.cpp on a single RTX 3090, no external LLM providers) perform well on terminal-bench 2.0 by tuning the agent harness and the server config, one measured run at a time. Each run below links to a detail page with what was tested, the results, an interpretation and the next steps — read the story top to bottom to follow the iterations.

All runs

run	model	thinking	trials	notes	mean
smoke__qwen3.6-35b-a3b__20260704-121319	llama-local/qwen3.6-35b-a3b	on	3	✓	0.33
suite__qwen3.6-35b-a3b__20260703-003556	llama-local/qwen3.6-35b-a3b	on	89	✓	0.37
smoke__qwen3.6-35b-a3b__20260702-213821	llama-local/qwen3.6-35b-a3b	on	20	✓	0.75
smoke__qwen3.6-35b-a3b__20260702-191746	llama-local/qwen3.6-35b-a3b	on	30	✓	0.77
smoke__qwen3.6-27b__20260702-190304	llama-local/qwen3.6-27b	on	4	✓	1.00
smoke__qwen3.6-35b-a3b__20260702-181612	llama-local/qwen3.6-35b-a3b	on	10	✓	0.80
smoke__qwen3.6-35b-a3b__20260702-180115	llama-local/qwen3.6-35b-a3b	on	1	✓	1.00
cmpthink-on__qwen3.6-35b-a3b__20260702-172246	llama-local/qwen3.6-35b-a3b	on	4	✓	0.75
cmpthink-off__qwen3.6-35b-a3b__20260702-171738	llama-local/qwen3.6-35b-a3b	off	4	✓	0.75
smoke__qwen3.6-35b-a3b__20260702-164851	llama-local/qwen3.6-35b-a3b	(none)	4	✓	0.75
smoke__qwen3.6-35b-a3b__20260702-163559	llama-local/qwen3.6-35b-a3b	(none)	4	✓	1.00

Analysis reports

Full-suite analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (89/89 complete)
Failure-mode analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (partial suite)

The story so far

Chronological — each entry is one run's headline; click through for the full notes (what changed, results, interpretation, next steps).

2026-07-02 16:36 1.00 smoke__qwen3.6-35b-a3b__20260702-163559
First smoke test — the pipeline works end to end (4/4, but luck was involved)
2026-07-02 16:48 0.75 smoke__qwen3.6-35b-a3b__20260702-164851
Re-run exposes the runaway-reasoning loop (3/4) — the first real failure mode
2026-07-02 17:17 0.75 cmpthink-off__qwen3.6-35b-a3b__20260702-171738
A/B part 1 — thinking OFF: no loops, but the harder task regresses (3/4)
2026-07-02 17:22 0.75 cmpthink-on__qwen3.6-35b-a3b__20260702-172246
A/B part 2 — thinking ON (uncapped): the mirror image (3/4)
2026-07-02 18:01 1.00 smoke__qwen3.6-35b-a3b__20260702-180115
Reasoning-cap proxy validation — the cap fires and the looping task passes (1/1)
2026-07-02 18:16 0.80 smoke__qwen3.6-35b-a3b__20260702-181612
Proxy cap, K=5 stats — regex-log 5/5; openssl's misses are not budget-related (0.80)
2026-07-02 19:03 1.00 smoke__qwen3.6-27b__20260702-190304
Dense 27B sanity check through the proxy — 4/4
2026-07-02 19:17 0.77 smoke__qwen3.6-35b-a3b__20260702-191746
Proxy cap at K=15 — the evidence run: regex-log 93%, openssl is model baseline (0.77)
2026-07-02 21:38 0.75 smoke__qwen3.6-35b-a3b__20260702-213821
Proxy → native server flag: parity confirmed, proxy retired (0.75)
2026-07-03 00:35 0.37 suite__qwen3.6-35b-a3b__20260703-003556
Full 89-task suite (complete) — 0.371 final; a ~50k context cliff dominates, runaways are oversized writes not text loops
2026-07-04 12:13 0.33 smoke__qwen3.6-35b-a3b__20260704-121319
Context-fix smoke — guard installs cleanly, zero 400s, but it never fired; both former 400-victims now fail as near-misses instead of context deaths

Journal (mechanism & notes)

Runs journal

Human-readable notes on the benchmark runs under runs/. The per-trial JSON records the model, thinking, max_tokens, llama_port, rewards and tokens — but not which reasoning-budget mechanism was in effect. That's the whole point of this file.

Per-run narrative lives in runs/<job>/NOTES.md (what was run & changed, results, interpretation, next steps) — rendered on each run's page of the Pages site, with the first # heading used as the run's timeline headline. This file stays the cross-run journal: mechanism history and how to read old runs.

How to read a run (things not in the logs)

llama_port in the trial config tells you the path:
8020 = the agent talked directly to llama-swap.
8021 = the agent went through the reasoning-cap proxy (a since-removed host-side proxy that capped reasoning at ~4000 tok/turn and forced an answer). Any port-8021 run therefore had an effective reasoning budget of 4000.
thinking = (none) in the config means the run predates the thinking control (the early reasoning: false extension bug): reasoning was on but uncontrolled, so it could loop and burn the whole budget.
The reasoning-budget mechanism evolved in three phases (see table): 1. none — thinking uncontrolled, no cap (early runs; regex-log loops). 2. proxy — external reasoning-cap proxy, cap 4000, llama_port=8021. 3. server flag — llama.cpp native --reasoning-budget 4000 set in the llama-swap config (llama_cpp_on_my_desktop, committed 2026-07-02, server restarted the same day). This is the current & future mechanism.

⚠️ Reading runs from here on

Future runs use the server flag, so they will show llama_port=8020 and thinking=on — which looks identical in the logs to the old uncapped port-8020 runs (e.g. cmpthink-on). The difference (budget enforced server-side) is invisible to the trial JSON. Tell them apart by date (anything after the 2026-07-02 server restart has --reasoning-budget 4000) and by this journal. The budget value lives in the llama-swap config's git history, not in these logs — if you change it, add a row below noting the new value.

Run log (all qwen3.6-35b-a3b MoE unless noted)

Job (runs/…)	Reasoning budget	thinking	tasks × K	result	notes
`smoke__…163559`	none (uncontrolled)	(none)	4 × 1	4/4	first smoke; regex-log happened to terminate
`smoke__…164851`	none (uncontrolled)	(none)	4 × 1	3/4	exposed the loop: regex-log burned 32k tok on reasoning, failed
`cmpthink-off__…171738`	none	off	4 × 1	3/4	thinking OFF: regex-log passes, openssl fails
`cmpthink-on__…172246`	none	on	4 × 1	3/4	thinking ON uncapped: regex-log loops/fails, openssl passes
`smoke__…180115`	proxy 4000	on	regex-log × 1	1/1	proxy validation; cap fired 2×
`smoke__…181612`	proxy 4000	on	2 × 5 (=10)	0.80	regex-log 5/5, openssl 3/5
`smoke__…191732`	proxy 4000	on	regex-log × 1	errored	single errored trial (ignore)
`smoke__…191746`	proxy 4000	on	2 × 15 (=30)	0.77	regex-log 14/15 (93%), openssl 9/15 (60%)
`smoke__qwen3.6-27b__190304`	proxy 4000	on	4 × 1	4/4	dense 27B model through the proxy
`smoke__…213821`	server flag 4000	on	2 × 10 (=20)	0.75	first confirmed server-flag run (`--reasoning-budget 4000`, port 8020). regex-log 8/10, openssl 7/10. Parity with proxy confirmed — see below
`suite__…20260703-003556`	server flag 4000	on	89 × 1	0.371	first full suite, ~23.5 h. See its NOTES.md + the two ANALYSIS-*.md reports
`smoke__…20260704-121319`	server flag 4000	on	3 × 1	1/3	first context-fix run (`context_window=229376` + `COMPACT_AT_TOKENS=200000` guard): guard active in every trial but never fired (peaks 51–165k), zero hard 400s; both former 400-victims now clean-stop near-misses — trajectory variance, not the fix. See its NOTES.md

What the proxy-4000 runs showed (181612 + 191746)

The budget works. regex-log — which loops and fails uncapped — passed ~95% (19/20) with the cap. That jump is the evidence.
openssl-selfsigned-cert ~60% is NOT a budget issue. Its entire trajectory is only ~1.7k–3.7k output tokens, so per-turn reasoning never reaches the 4000 cap — the budget never fires for it. 60% is the model's baseline on that task.
Token totals are per-trajectory (all turns), not per-turn. regex-log's high totals (up to ~50k) are ~10 agent turns, each with reasoning capped at ~4000 — not a single runaway. Judge the cap by pass rate, not token totals.

Conclusion: 4000 fixes the loop without hurting the short task. It hasn't been tuned on breadth — a proper sweep (2000 / 4000 / 8000) should run against the full suite, since different tasks stress reasoning length differently. And these were proxy runs; re-verify parity now that the server flag is live.

Server-flag parity — verified on 213821

This is the first run on the native server budget (--reasoning-budget 4000 in the llama-swap config), replacing the proxy. Because a server-flag run is indistinguishable from an old uncapped port-8020 run in the trial JSON, parity was verified directly from the transcripts (agent/pi.txt):

The cap engages. Largest single thinking block per trial: max ≈3,141 tok (median ~146), vs ~20,000 tok in the uncapped runs (164851, cmpthink-on 172246) and ~4,001 tok under the proxy. No reasoning block gets near the 32k maxTokens ceiling. (The server tops out below 4000 because llama.cpp nudges the model to close </think> as it nears the budget, rather than the proxy's hard splice exactly at 4000.)
Zero runaway terminations. Across all 20 trials / 278 assistant messages, 0 stopReason: length — every turn ended cleanly (toolUse/stop). The old uncapped 164851 had a length stop; that failure mode is gone.
No quality regression. Mean 0.75 vs proxy's 0.77 (30 trials) and the earlier ~0.75 — statistically identical.
Remaining failures are not budget-related. The regex-log FAILs loop at the agent level (many tool-call turns, e.g. one burned 61.8k output tokens across turns while its largest think block was only ~3,141 tok), and openssl's 70% is the model's baseline (its short trajectories never reach the cap).

Verdict: proxy → server flag is a clean swap — same bounding, same scores, no extra process. 4000 stays as the value pending a full-suite breadth sweep. To re-check on any future run: max thinking-block tokens should stay well under 4000 and stopReason: length count should be 0.