Other half of the thinking comparison: same 4 smoke tasks, qwen3.6-35b-a3b,
--thinking on, still no reasoning cap of any kind.
3/4, and exactly the mirror of cmpthink-off: openssl-selfsigned-cert
passes (thinking helps it), while regex-log loops in reasoning and
fails again.
The A/B is conclusive: thinking has real per-task value, and the loop is reproducible — it's a property of uncapped reasoning on this model, not noise. The requirement is now precise: keep thinking on, cap the per-turn reasoning length, force an answer at the cap.
Implement a reasoning cap. First mechanism: an external host-side proxy (port 8021) that splices in the end-of-thinking tag at ~4000 tokens — validated in the next run.
llama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 3 pass · 1 failmean reward0.75tokens (job total)80,679 in / 36,833 outstarted / finished2026-07-02T17:22 / 2026-07-02T17:29wall clock7m05s| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 57s | 7s | 17008/1033 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m02s | 13s | 35684/1887 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m01s | 12s | 26230/1913 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | FAIL | 4m04s | 3m05s | 1757/32000 |