← older: smoke__qwen3.6-35b-a3b__20260702-191746all runsnewer: suite__qwen3.6-35b-a3b__20260703-003556

smoke__qwen3.6-35b-a3b__20260702-213821

Proxy → native server flag: parity confirmed, proxy retired (0.75)

What was run

First run on the native mechanism: llama.cpp's --reasoning-budget 4000 set in the llama-swap config (server restarted 2026-07-02), replacing the external proxy. Same two signal tasks × 10 attempts (20 trials), thinking ON, port 8020 (direct — from here on, direct runs are budget-capped server-side; the trial JSON can't tell them apart from old uncapped runs, see RUNS.md).

Results

Mean 0.75: regex-log 8/10, openssl-selfsigned-cert 7/10. Verified from the transcripts: - Largest single <think> block ≈ 3,141 tokens (median ~146) vs ~20,000 in the uncapped runs — the cap engages, and the server nudges </think> closed before 4000 rather than hard-splicing at it like the proxy did. - 0 × stopReason: length across all 278 assistant messages — the runaway-turn failure mode is gone at this task scale.

Interpretation

Clean swap: same bounding, statistically identical score (0.75 vs proxy's 0.77), one less process. The server flag is the mechanism going forward. Remaining failures are agent-level (many-turn meandering) or model baseline — not reasoning-budget issues.

Next steps

Stop tuning on 2 tasks — run the full 89-task suite to get breadth, find the failure modes the smoke set can't show, and collect data for the reasoning-budget sweep.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials20 of 20 — 15 pass · 5 failmean reward0.75tokens (job total)3,375,951 in / 303,504 outstarted / finished2026-07-02T21:38 / 2026-07-02T22:28wall clock50m02s

Tasks

openssl-selfsigned-cert — 7/10 passed

#resulttotalagentin/out tok
1FAIL1m13s22s56946/3341
2PASS1m01s12s31901/1848
3PASS1m05s16s51366/2612
4PASS1m07s18s49881/2959
5PASS1m03s11s21314/1742
6PASS1m02s11s28437/1826
7FAIL1m02s13s30972/1686
8PASS1m03s12s26288/1939
9PASS1m11s17s37678/2903
10FAIL1m06s14s27558/1941

regex-log — 8/10 passed

#resulttotalagentin/out tok
1PASS2m04s1m08s90522/11656
2FAIL7m50s6m52s670814/61801
3PASS3m21s2m23s206399/23252
4FAIL3m30s2m32s213304/25445
5PASS3m45s2m42s264809/26357
6PASS3m40s2m41s210972/26023
7PASS6m06s5m08s713288/49380
8PASS3m06s2m06s206969/21070
9PASS3m02s2m00s235790/19868
10PASS2m36s1m35s200743/15855