Identical repeat of the first smoke run (same 4 tasks, qwen3.6-35b-a3b, thinking uncontrolled) to check whether 4/4 was stable.
3/4. regex-log failed: the model looped inside <think> and burned
~32k tokens — the entire maxTokens budget — on reasoning in a single
turn, so the turn produced no usable action and the task failed. The transcript
shows a stopReason: length on that turn.
This is the defining failure mode of these Qwen reasoning models when thinking is unbounded: the reasoning trace doesn't self-terminate, and one turn can eat the whole output budget. The previous run's 4/4 was luck. Reasoning needs to be bounded, not just enabled/disabled.
A/B experiment: same tasks with thinking off vs thinking on (still
uncapped) to see what thinking actually buys — the cmpthink-* runs.
llama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinking(none)reasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 3 pass · 1 failmean reward0.75tokens (job total)109,209 in / 37,503 outstarted / finished2026-07-02T16:48 / 2026-07-02T16:55wall clock7m06s| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 58s | 10s | 40989/1413 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m05s | 13s | 34654/1877 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m02s | 13s | 31809/2213 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | FAIL | 3m59s | 3m00s | 1757/32000 |