First run on the native mechanism: llama.cpp's --reasoning-budget 4000
set in the llama-swap config (server restarted 2026-07-02), replacing the
external proxy. Same two signal tasks × 10 attempts (20 trials), thinking ON,
port 8020 (direct — from here on, direct runs are budget-capped
server-side; the trial JSON can't tell them apart from old uncapped runs, see
RUNS.md).
Mean 0.75: regex-log 8/10, openssl-selfsigned-cert 7/10. Verified from
the transcripts:
- Largest single <think> block ≈ 3,141 tokens (median ~146) vs ~20,000 in
the uncapped runs — the cap engages, and the server nudges </think> closed
before 4000 rather than hard-splicing at it like the proxy did.
- 0 × stopReason: length across all 278 assistant messages — the
runaway-turn failure mode is gone at this task scale.
Clean swap: same bounding, statistically identical score (0.75 vs proxy's 0.77), one less process. The server flag is the mechanism going forward. Remaining failures are agent-level (many-turn meandering) or model baseline — not reasoning-budget issues.
Stop tuning on 2 tasks — run the full 89-task suite to get breadth, find the failure modes the smoke set can't show, and collect data for the reasoning-budget sweep.
llama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials20 of 20 — 15 pass · 5 failmean reward0.75tokens (job total)3,375,951 in / 303,504 outstarted / finished2026-07-02T21:38 / 2026-07-02T22:28wall clock50m02s| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | FAIL | 1m13s | 22s | 56946/3341 |
| 2 | PASS | 1m01s | 12s | 31901/1848 |
| 3 | PASS | 1m05s | 16s | 51366/2612 |
| 4 | PASS | 1m07s | 18s | 49881/2959 |
| 5 | PASS | 1m03s | 11s | 21314/1742 |
| 6 | PASS | 1m02s | 11s | 28437/1826 |
| 7 | FAIL | 1m02s | 13s | 30972/1686 |
| 8 | PASS | 1m03s | 12s | 26288/1939 |
| 9 | PASS | 1m11s | 17s | 37678/2903 |
| 10 | FAIL | 1m06s | 14s | 27558/1941 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 2m04s | 1m08s | 90522/11656 |
| 2 | FAIL | 7m50s | 6m52s | 670814/61801 |
| 3 | PASS | 3m21s | 2m23s | 206399/23252 |
| 4 | FAIL | 3m30s | 2m32s | 213304/25445 |
| 5 | PASS | 3m45s | 2m42s | 264809/26357 |
| 6 | PASS | 3m40s | 2m41s | 210972/26023 |
| 7 | PASS | 6m06s | 5m08s | 713288/49380 |
| 8 | PASS | 3m06s | 2m06s | 206969/21070 |
| 9 | PASS | 3m02s | 2m00s | 235790/19868 |
| 10 | PASS | 2m36s | 1m35s | 200743/15855 |