Proxy → native server flag: parity confirmed, proxy retired (0.75)

What was run

First run on the native mechanism: llama.cpp's --reasoning-budget 4000 set in the llama-swap config (server restarted 2026-07-02), replacing the external proxy. Same two signal tasks × 10 attempts (20 trials), thinking ON, port 8020 (direct — from here on, direct runs are budget-capped server-side; the trial JSON can't tell them apart from old uncapped runs, see RUNS.md).

Results

Mean 0.75: regex-log 8/10, openssl-selfsigned-cert 7/10. Verified from the transcripts: - Largest single <think> block ≈ 3,141 tokens (median ~146) vs ~20,000 in the uncapped runs — the cap engages, and the server nudges </think> closed before 4000 rather than hard-splicing at it like the proxy did. - 0 × stopReason: length across all 278 assistant messages — the runaway-turn failure mode is gone at this task scale.

Interpretation

Clean swap: same bounding, statistically identical score (0.75 vs proxy's 0.77), one less process. The server flag is the mechanism going forward. Remaining failures are agent-level (many-turn meandering) or model baseline — not reasoning-budget issues.

Next steps

Stop tuning on 2 tasks — run the full 89-task suite to get breadth, find the failure modes the smoke set can't show, and collect data for the reasoning-budget sweep.

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials20 of 20 — 15 pass · 5 failmean reward0.75tokens (job total)3,375,951 in / 303,504 outstarted / finished2026-07-02T21:38 / 2026-07-02T22:28wall clock50m02s

#	result	total	agent	in/out tok
1	FAIL	1m13s	22s	56946/3341
2	PASS	1m01s	12s	31901/1848
3	PASS	1m05s	16s	51366/2612
4	PASS	1m07s	18s	49881/2959
5	PASS	1m03s	11s	21314/1742
6	PASS	1m02s	11s	28437/1826
7	FAIL	1m02s	13s	30972/1686
8	PASS	1m03s	12s	26288/1939
9	PASS	1m11s	17s	37678/2903
10	FAIL	1m06s	14s	27558/1941

#	result	total	agent	in/out tok
1	PASS	2m04s	1m08s	90522/11656
2	FAIL	7m50s	6m52s	670814/61801
3	PASS	3m21s	2m23s	206399/23252
4	FAIL	3m30s	2m32s	213304/25445
5	PASS	3m45s	2m42s	264809/26357
6	PASS	3m40s	2m41s	210972/26023
7	PASS	6m06s	5m08s	713288/49380
8	PASS	3m06s	2m06s	206969/21070
9	PASS	3m02s	2m00s	235790/19868
10	PASS	2m36s	1m35s	200743/15855

smokeqwen3.6-35b-a3b20260702-213821

Proxy → native server flag: parity confirmed, proxy retired (0.75)

What was run

Results

Interpretation

Next steps

Run details

Tasks

openssl-selfsigned-cert — 7/10 passed

regex-log — 8/10 passed

smoke__qwen3.6-35b-a3b__20260702-213821

Proxy → native server flag: parity confirmed, proxy retired (0.75)

What was run

Results

Interpretation

Next steps

Run details

Tasks

openssl-selfsigned-cert — 7/10 passed

regex-log — 8/10 passed

smokeqwen3.6-35b-a3b20260702-213821