all runsnewer: smoke__qwen3.6-35b-a3b__20260702-164851

smoke__qwen3.6-35b-a3b__20260702-163559

First smoke test — the pipeline works end to end (4/4, but luck was involved)

What was run

First-ever smoke run of the custom minimal-pi harness against the local llama-swap server: qwen3.6-35b-a3b (MoE, 3B active), 4 quick tasks × 1 attempt. Thinking was uncontrolled — this predates any reasoning-budget mechanism and even the thinking kwarg (config shows thinking=(none)).

Results

4/4 passed. Harness plumbing (in-container pi → Docker gateway → llama-swap on port 8020, custom provider extension with raised maxTokens) all worked.

Interpretation

A clean sheet, but misleading: regex-log happened to terminate its reasoning this time. The very next run showed that with uncontrolled thinking the model can loop in <think> and burn the whole output budget — this pass was partly luck, not a solved problem.

Next steps

Re-run the same smoke set to check stability of the result.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinking(none)reasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 4 pass · 0 failmean reward1.00tokens (job total)516,270 in / 46,042 outstarted / finished2026-07-02T16:36 / 2026-07-02T16:44wall clock8m49s

Tasks

fix-git — 1/1 passed

#resulttotalagentin/out tok
1PASS1m27s23s26570/1103

nginx-request-logging — 1/1 passed

#resulttotalagentin/out tok
1PASS1m09s14s40905/2044

openssl-selfsigned-cert — 1/1 passed

#resulttotalagentin/out tok
1PASS1m09s19s36334/3139

regex-log — 1/1 passed

#resulttotalagentin/out tok
1PASS5m02s4m02s412461/39756