← older: suite__qwen3.6-35b-a3b__20260703-003556all runs

smoke__qwen3.6-35b-a3b__20260704-121319

Context-fix smoke — guard installs cleanly, zero 400s, but it never fired; both former 400-victims now fail as near-misses instead of context deaths

What was run

First run with the H6 context fix from the full-suite analysis (ANALYSIS-qwen3.6-35b-a3b-suite-full.md §2/§6):

Task pick: regex-log (baseline-pass control) + video-processing and path-tracing-reverse — two of the four trials the 20260703 suite lost to the hard 262k 400. qwen3.6-35b-a3b, thinking on, server --reasoning-budget 4000, max_tokens=32768, 3 tasks × 1, ~24 min wall.

Results — 1/3, but read the mechanics, not the score

task suite 20260703 this run
regex-log PASS 3m20s, peak ctx 39k PASS 4m37s, peak ctx 51k
path-tracing-reverse FAIL — hard-400 death at 261k ctx, 29m FAIL — clean stop at 165k peak, 8m01s, 2/3 subtests (image.c exists + compiles; similarity 0.55 vs ≥0.995)
video-processing FAIL — hard-400 death at 260k ctx, 19m, + a length-stop runaway FAIL — clean stop at 113k peak, 8m20s, 3/5 subtests (analyzer exists/imports; both frame-detection assertions fail)

Reproduce: make failures RUN=runs/smoke__qwen3.6-35b-a3b__20260704-121319.

Interpretation

  1. The guard is installed and harmless — but UNVALIDATED. minimal-pi: context guard active (compact_at=200000) appears in every trial, there are zero hard 400s, and the regex-log control still passes with timing in family. But no trial crossed 200k, so the guard never actually fired — zero forced compactions, zero compaction events of any kind. This run proves the fix doesn't break anything; it does not prove it recovers a 400-bound trial.

  2. The two 400-victims improved by trajectory variance, not by the fix. Both ended with a voluntary clean stop at 113–165k — well below where either the guard or the declared window matters — in 8 min instead of 19–29 min, and video-processing didn't repeat its length-stop runaway. Same tasks, same model, same sampler: they simply drew shorter, luckier trajectories this time. This is the K=1 variance the full analysis flagged (H5), landing in the favorable direction. Do not book these as wins for H6.

  3. Both fails reproduce the "near-miss on a hidden criterion" pattern (§4). Both self-declared done; both passed hidden subtests. path-tracing-reverse is literally one assertion away — its reconstructed C program exists and compiles, but renders the wrong image (cosine similarity 0.5547 vs the required ≥0.995). video-processing confidently wrote jump_takeoff_frame_number = 48 / jump_land_frame_number = 72 to output.toml; both functional assertions disagree. These are capability misses, not harness misses.

  4. NEW texture — degenerate final turn on path-tracing-reverse. Its last assistant message cuts off mid-<think> (mid-sentence, analyzing xmm0 bit layout), and the entire visible text is the literal token user — a chat-template role-marker leak — followed by a clean stop. The agent didn't decide it was done; it fell out of the conversation while holding a compiling near-miss it never got to iterate on. Plausible mechanism: the server's reasoning-budget nudge force-closes </think> and the model resumes in the wrong template state. One occurrence = anecdote; if it recurs, it's a stop-token/template bug worth chasing (grep pi.txt for a final stop whose only text is user).

  5. Cliff-consistent, for what n=3 is worth: the one trial ≤51k passed, both

    100k failed. No new evidence, no contradiction.

GPU log — first run with gpu_samples.csv; the GPU is busy 91% of wall time

First run with the sampler active (15 s cadence, 96 samples over 23.8 min). Power is an unambiguous idle/busy classifier: every sample is either at the 27 W idle floor (9 samples) or in a ~289 W decode band (87 samples) — nothing in between, exactly the bimodal split the sampler was designed around. (Temp alone would mislead: several "idle" samples still read 61–66 °C from the preceding burst; power is the signal.)

All 9 idle samples (~2.2 min, 9.4%) fall in per-trial overhead, none in agent execution. Mapping each sample to its trial phase: 8 land in environment_setup/agent_setup (the ~31–35 s pi install per trial) and the ~11 s inter-trial teardown gaps; the 1 "idle" sample inside an agent_execution window is 1 s after the phase started, before the first prefill. During actual agent execution the GPU never went idle — 0 idle in 84 in-execution samples.

So for LLM-bound tasks the harness adds ~45–50 s of GPU-off overhead per trial (≈1.2 h if extrapolated to 89 suite trials) and wastes nothing else. This neither confirms nor refutes the suite's estimated ~7.5 h of GPU-idle container waits — that idle pool is attributed to CPU-bound in-task work (train-fasttext, compile-compcert, crack-7z-hash at 1–2 enforced cores), which this smoke set doesn't exercise, and the 20260703 suite predates the sampler. The next long run's gpu_samples.csv will answer it directly; a cheap preview is one CPU-bound smoke task (e.g. crack-7z-hash) and counting sub-45 W samples inside its agent_execution window.

Next steps

  1. Actually validate the guard, cheaply: run make smoke SMOKE_TASKS=regex-log COMPACT_AT_TOKENS=40000 — regex-log peaks ~40–51k, so the hook must fire; confirm the forcing compaction log line + a compaction event in pi.txt + the task still passes. Then re-run the two unretested 400-victims (circuit-fibsqrt, make-mips-interpreter — both >257k in the suite) at the default 200k threshold.
  2. K=3 on path-tracing-reverse + video-processing to separate variance from any real effect before attributing anything to the window change (H5).
  3. Keep an eye out for the user role-leak ending (point 4) in future transcripts.
  4. To measure the suspected suite-scale GPU-idle pool: include one CPU-bound task (e.g. crack-7z-hash) in a smoke and count sub-45 W samples inside its agent_execution window — the 20260703 suite has no gpu_samples.csv, so the ~7.5 h idle estimate is still analytical, not measured.

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 229376agent timeout ×2.0trials3 of 3 — 1 pass · 2 failmean reward0.33tokens (job total)7,560,039 in / 150,657 outstarted / finished2026-07-04T12:13 / 2026-07-04T12:37wall clock23m48s

Tasks

path-tracing-reverse — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m57s8m01s4674900/45850

regex-log — 1/1 passed

#resulttotalagentin/out tok
1PASS5m33s4m37s702930/42962

video-processing — 0/1 passed

#resulttotalagentin/out tok
1FAIL9m17s8m20s2182209/61845