← older: smoke__qwen3.6-35b-a3b__20260702-164851all runsnewer: cmpthink-on__qwen3.6-35b-a3b__20260702-172246

cmpthink-off__qwen3.6-35b-a3b__20260702-171738

A/B part 1 — thinking OFF: no loops, but the harder task regresses (3/4)

What was run

Half of a thinking on/off comparison: same 4 smoke tasks, qwen3.6-35b-a3b, with --thinking off (the harness now sends chat_template_kwargs.enable_thinking=false via pi's qwen-chat-template compat — this run validated that control path, which had an extension bug earlier).

Results

3/4. regex-log — the task that loops with thinking on — passes with thinking off. But openssl-selfsigned-cert fails.

Interpretation

Together with the cmpthink-on sibling run this shows a clean trade-off: thinking OFF prevents the runaway loop but costs quality on tasks that benefit from deliberation; thinking ON helps those tasks but risks the loop. Neither binary setting is right — we want thinking ON but length-bounded.

Next steps

Run the cmpthink-on sibling to complete the A/B, then build a reasoning cap.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingoffreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 3 pass · 1 failmean reward0.75tokens (job total)271,546 in / 14,916 outstarted / finished2026-07-02T17:17 / 2026-07-02T17:22wall clock5m06s

Tasks

fix-git — 1/1 passed

#resulttotalagentin/out tok
1PASS59s7s33741/807

nginx-request-logging — 1/1 passed

#resulttotalagentin/out tok
1PASS1m01s10s46907/1254

openssl-selfsigned-cert — 0/1 passed

#resulttotalagentin/out tok
1FAIL1m00s10s29155/1239

regex-log — 1/1 passed

#resulttotalagentin/out tok
1PASS2m05s1m09s161743/11616