The dense sibling model qwen3.6-27b (not the MoE) on the 4-task smoke set, 1 attempt each, through the reasoning-cap proxy (port 8021, cap ~4000), thinking ON.
4/4.
The harness + proxy mechanism is model-agnostic across the two Qwen variants, and the dense 27B handles the smoke set fine. The dense model decodes slower than the MoE (~3B active) though, so the MoE remains the primary benchmark target for full-suite runs on this single 3090.
Back to the MoE for a bigger-K statistical run of the two signal tasks.
llama-local/qwen3.6-27bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetproxy · cap 4000 — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials4 of 4 — 4 pass · 0 failmean reward1.00tokens (job total)399,991 in / 22,645 outstarted / finished2026-07-02T19:03 / 2026-07-02T19:16wall clock13m06s| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m23s | 35s | 31489/1589 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m38s | 47s | 44147/2390 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 1m31s | 41s | 30601/2241 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 8m33s | 7m31s | 293754/16425 |