First run with the H6 context fix from the full-suite analysis
(ANALYSIS-qwen3.6-35b-a3b-suite-full.md §2/§6):
context_window 131072 → 229376 (a safe ~32k below the server's
real -c 262144);COMPACT_AT_TOKENS=200000 overflow guard — the generated pi extension
registers a turn_end hook that forces compaction once live context crosses
200k, so a single large turn can never overshoot into the fatal
400 … exceeds the available context size.Task pick: regex-log (baseline-pass control) + video-processing and
path-tracing-reverse — two of the four trials the 20260703 suite lost to the
hard 262k 400. qwen3.6-35b-a3b, thinking on, server --reasoning-budget 4000,
max_tokens=32768, 3 tasks × 1, ~24 min wall.
| task | suite 20260703 | this run |
|---|---|---|
| regex-log | PASS 3m20s, peak ctx 39k | PASS 4m37s, peak ctx 51k |
| path-tracing-reverse | FAIL — hard-400 death at 261k ctx, 29m | FAIL — clean stop at 165k peak, 8m01s, 2/3 subtests (image.c exists + compiles; similarity 0.55 vs ≥0.995) |
| video-processing | FAIL — hard-400 death at 260k ctx, 19m, + a length-stop runaway | FAIL — clean stop at 113k peak, 8m20s, 3/5 subtests (analyzer exists/imports; both frame-detection assertions fail) |
Reproduce: make failures RUN=runs/smoke__qwen3.6-35b-a3b__20260704-121319.
The guard is installed and harmless — but UNVALIDATED.
minimal-pi: context guard active (compact_at=200000) appears in every
trial, there are zero hard 400s, and the regex-log control still passes with
timing in family. But no trial crossed 200k, so the guard never actually
fired — zero forced compactions, zero compaction events of any kind. This
run proves the fix doesn't break anything; it does not prove it recovers
a 400-bound trial.
The two 400-victims improved by trajectory variance, not by the fix. Both ended with a voluntary clean stop at 113–165k — well below where either the guard or the declared window matters — in 8 min instead of 19–29 min, and video-processing didn't repeat its length-stop runaway. Same tasks, same model, same sampler: they simply drew shorter, luckier trajectories this time. This is the K=1 variance the full analysis flagged (H5), landing in the favorable direction. Do not book these as wins for H6.
Both fails reproduce the "near-miss on a hidden criterion" pattern (§4).
Both self-declared done; both passed hidden subtests. path-tracing-reverse
is literally one assertion away — its reconstructed C program exists and
compiles, but renders the wrong image (cosine similarity 0.5547 vs the
required ≥0.995). video-processing confidently wrote
jump_takeoff_frame_number = 48 / jump_land_frame_number = 72 to
output.toml; both functional assertions disagree. These are capability
misses, not harness misses.
NEW texture — degenerate final turn on path-tracing-reverse. Its last
assistant message cuts off mid-<think> (mid-sentence, analyzing xmm0 bit
layout), and the entire visible text is the literal token user — a
chat-template role-marker leak — followed by a clean stop. The agent
didn't decide it was done; it fell out of the conversation while holding a
compiling near-miss it never got to iterate on. Plausible mechanism: the
server's reasoning-budget nudge force-closes </think> and the model
resumes in the wrong template state. One occurrence = anecdote; if it
recurs, it's a stop-token/template bug worth chasing (grep pi.txt for a
final stop whose only text is user).
Cliff-consistent, for what n=3 is worth: the one trial ≤51k passed, both
100k failed. No new evidence, no contradiction.
First run with the sampler active (15 s cadence, 96 samples over 23.8 min). Power is an unambiguous idle/busy classifier: every sample is either at the 27 W idle floor (9 samples) or in a ~289 W decode band (87 samples) — nothing in between, exactly the bimodal split the sampler was designed around. (Temp alone would mislead: several "idle" samples still read 61–66 °C from the preceding burst; power is the signal.)
All 9 idle samples (~2.2 min, 9.4%) fall in per-trial overhead, none in
agent execution. Mapping each sample to its trial phase: 8 land in
environment_setup/agent_setup (the ~31–35 s pi install per trial) and the
~11 s inter-trial teardown gaps; the 1 "idle" sample inside an agent_execution
window is 1 s after the phase started, before the first prefill. During actual
agent execution the GPU never went idle — 0 idle in 84 in-execution samples.
So for LLM-bound tasks the harness adds ~45–50 s of GPU-off overhead per trial
(≈1.2 h if extrapolated to 89 suite trials) and wastes nothing else. This
neither confirms nor refutes the suite's estimated ~7.5 h of GPU-idle container
waits — that idle pool is attributed to CPU-bound in-task work
(train-fasttext, compile-compcert, crack-7z-hash at 1–2 enforced cores), which
this smoke set doesn't exercise, and the 20260703 suite predates the sampler.
The next long run's gpu_samples.csv will answer it directly; a cheap preview is
one CPU-bound smoke task (e.g. crack-7z-hash) and counting sub-45 W samples
inside its agent_execution window.
make smoke SMOKE_TASKS=regex-log COMPACT_AT_TOKENS=40000 — regex-log peaks
~40–51k, so the hook must fire; confirm the
forcing compaction log line + a compaction event in pi.txt + the task
still passes. Then re-run the two unretested 400-victims
(circuit-fibsqrt, make-mips-interpreter — both >257k in the suite) at
the default 200k threshold.user role-leak ending (point 4) in future
transcripts.crack-7z-hash) in a smoke and count sub-45 W samples inside its
agent_execution window — the 20260703 suite has no gpu_samples.csv, so the
~7.5 h idle estimate is still analytical, not measured.llama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 229376agent timeout ×2.0trials3 of 3 — 1 pass · 2 failmean reward0.33tokens (job total)7,560,039 in / 150,657 outstarted / finished2026-07-04T12:13 / 2026-07-04T12:37wall clock23m48s| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | FAIL | 8m57s | 8m01s | 4674900/45850 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | PASS | 5m33s | 4m37s | 702930/42962 |
| # | result | total | agent | in/out tok |
|---|---|---|---|---|
| 1 | FAIL | 9m17s | 8m20s | 2182209/61845 |