← older: cmpthink-off__qwen3.6-35b-a3b__20260702-171738all runsnewer: smoke__qwen3.6-35b-a3b__20260702-180115

cmpthink-on__qwen3.6-35b-a3b__20260702-172246

A/B part 2 — thinking ON (uncapped): the mirror image (3/4)

What was run

Other half of the thinking comparison: same 4 smoke tasks, qwen3.6-35b-a3b, --thinking on, still no reasoning cap of any kind.

Results

3/4, and exactly the mirror of cmpthink-off: openssl-selfsigned-cert passes (thinking helps it), while regex-log loops in reasoning and fails again.

Interpretation

The A/B is conclusive: thinking has real per-task value, and the loop is reproducible — it's a property of uncapped reasoning on this model, not noise. The requirement is now precise: keep thinking on, cap the per-turn reasoning length, force an answer at the cap.

Next steps

Implement a reasoning cap. First mechanism: an external host-side proxy (port 8021) that splices in the end-of-thinking tag at ~4000 tokens — validated in the next run.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 3 pass · 1 failmean reward0.75tokens (job total)80,679 in / 36,833 outstarted / finished2026-07-02T17:22 / 2026-07-02T17:29wall clock7m05s

Tasks

fix-git — 1/1 passed

#resulttotalagentin/out tok
1PASS57s7s17008/1033

nginx-request-logging — 1/1 passed

#resulttotalagentin/out tok
1PASS1m02s13s35684/1887

openssl-selfsigned-cert — 1/1 passed

#resulttotalagentin/out tok
1PASS1m01s12s26230/1913

regex-log — 0/1 passed

#resulttotalagentin/out tok
1FAIL4m04s3m05s1757/32000