← older: smoke__qwen3.6-35b-a3b__20260702-163559all runsnewer: cmpthink-off__qwen3.6-35b-a3b__20260702-171738

smoke__qwen3.6-35b-a3b__20260702-164851

Re-run exposes the runaway-reasoning loop (3/4) — the first real failure mode

What was run

Identical repeat of the first smoke run (same 4 tasks, qwen3.6-35b-a3b, thinking uncontrolled) to check whether 4/4 was stable.

Results

3/4. regex-log failed: the model looped inside <think> and burned ~32k tokens — the entire maxTokens budget — on reasoning in a single turn, so the turn produced no usable action and the task failed. The transcript shows a stopReason: length on that turn.

Interpretation

This is the defining failure mode of these Qwen reasoning models when thinking is unbounded: the reasoning trace doesn't self-terminate, and one turn can eat the whole output budget. The previous run's 4/4 was luck. Reasoning needs to be bounded, not just enabled/disabled.

Next steps

A/B experiment: same tasks with thinking off vs thinking on (still uncapped) to see what thinking actually buys — the cmpthink-* runs.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinking(none)reasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 3 pass · 1 failmean reward0.75tokens (job total)109,209 in / 37,503 outstarted / finished2026-07-02T16:48 / 2026-07-02T16:55wall clock7m06s

Tasks

fix-git — 1/1 passed

#resulttotalagentin/out tok
1PASS58s10s40989/1413

nginx-request-logging — 1/1 passed

#resulttotalagentin/out tok
1PASS1m05s13s34654/1877

openssl-selfsigned-cert — 1/1 passed

#resulttotalagentin/out tok
1PASS1m02s13s31809/2213

regex-log — 0/1 passed

#resulttotalagentin/out tok
1FAIL3m59s3m00s1757/32000