Proxy cap at K=15 — the evidence run: regex-log 93%, openssl is model baseline (0.77)

What was run

The statistical run: the two signal tasks × 15 attempts each (30 trials) through the reasoning-cap proxy (port 8021, cap ~4000), qwen3.6-35b-a3b, thinking ON.

Results

Mean 0.77: regex-log 14/15 (93%), openssl-selfsigned-cert 9/15 (60%).

Interpretation

The budget works. A task that reliably loops and fails uncapped passes ~95% (19/20 across this run and the K=5 run) with the cap. This is the evidence that motivated making the cap permanent.
openssl stabilizes at ~60% — its short trajectories never reach the cap, so this is the model's capability baseline on that task, not a harness issue.
4000 was still an untuned first guess; a proper 2000/4000/8000 sweep on the full suite remains to be done.

Next steps

The proxy is an extra moving part. llama.cpp has a native --reasoning-budget N server flag that does the same job — switch the llama-swap config to it and verify parity, then retire the proxy.

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetproxy · cap 4000 — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials30 of 30 — 23 pass · 7 failmean reward0.77tokens (job total)8,439,802 in / 337,955 outstarted / finished2026-07-02T19:17 / 2026-07-02T20:40wall clock82m35s

#	result	total	agent	in/out tok
1	FAIL	1m26s	24s	103666/3493
2	FAIL	1m04s	15s	41642/2081
3	PASS	1m12s	21s	61843/3478
4	FAIL	1m16s	25s	59036/3405
5	PASS	1m09s	20s	45721/3215
6	PASS	1m06s	16s	31541/2627
7	PASS	1m05s	15s	41577/2432
8	PASS	1m04s	13s	26449/2151
9	FAIL	1m10s	18s	51563/2692
10	PASS	1m03s	14s	34558/2367
11	PASS	1m00s	10s	18628/1725
12	FAIL	1m10s	19s	52696/2799
13	PASS	1m10s	20s	46844/3205
14	PASS	1m13s	23s	77012/3676
15	FAIL	1m03s	12s	25687/1727

#	result	total	agent	in/out tok
1	PASS	3m44s	2m39s	245551/13559
2	PASS	3m29s	2m31s	434569/18856
3	PASS	6m29s	5m32s	1128803/36436
4	PASS	3m57s	2m58s	282053/16089
5	FAIL	5m52s	4m55s	926857/33742
6	PASS	3m06s	2m08s	128080/9398
7	PASS	2m53s	1m53s	185126/11999
8	PASS	5m05s	4m08s	456024/18137
9	PASS	2m56s	1m57s	168521/5712
10	PASS	3m50s	2m53s	271701/16024
11	PASS	3m39s	2m38s	527233/19523
12	PASS	4m26s	3m26s	397215/20459
13	PASS	3m31s	2m31s	246277/18877
14	PASS	3m55s	2m53s	286243/17394
15	PASS	8m19s	7m21s	2037086/40677

smokeqwen3.6-35b-a3b20260702-191746

Proxy cap at K=15 — the evidence run: regex-log 93%, openssl is model baseline (0.77)

What was run

Results

Interpretation

Next steps

Run details

Tasks

openssl-selfsigned-cert — 9/15 passed

regex-log — 14/15 passed

smoke__qwen3.6-35b-a3b__20260702-191746

Proxy cap at K=15 — the evidence run: regex-log 93%, openssl is model baseline (0.77)

What was run

Results

Interpretation

Next steps

Run details

Tasks

openssl-selfsigned-cert — 9/15 passed

regex-log — 14/15 passed

smokeqwen3.6-35b-a3b20260702-191746