Full-suite analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (89/89 complete)

Author: automated analysis (Claude), 2026-07-04 Supersedes/extends: ANALYSIS-qwen3.6-35b-a3b-suite-partial.md (which scored 45 of 89 tasks). This doc analyses the 43 previously-pending tasks now that runs/suite__qwen3.6-35b-a3b__20260703-003556 has finished (started 2026-07-03T00:35Z, finished 2026-07-04T00:09Z, ~23.5 h wall), re-checks the partial doc's six hypotheses against all 89 trials, and adds new hypotheses that only became visible with the full data.

Metrics were extracted programmatically from each trial's agent/pi.txt (stopReasons, per-turn think/text/tool sizes, peak context, compactions) and verifier/test-stdout.txt (hidden subtest pass/fail). Script: /tmp/extract_suite_metrics.py.

0. Headline: the full suite is harder than the partial implied

Cohort	Pass	Fail	Infra-noreward	Pass rate (scored)
Partial (first 45, doc'd)	19	25	1	43%
New (back 44)	14	29	1	32%
Full 89	33	54	2	37.1%

The suite mean reward is 0.371 (result.json), not the 0.42 partial estimate — the second half of the run was a harder draw. Reward is binary per task (all hidden subtests must pass), which matters a lot for the near-miss finding below.

Timing pattern from the partial doc holds: passes are cheap (mean 7.3 min, 241 min total) and 83% of wall-clock went to non-passing trials (mean 20.9 min).

1. Verdict on the partial doc's six hypotheses

#	Hypothesis (partial doc)	Verdict on full data
A	Context overflow / compaction is the biggest mechanical loss	Confirmed & bigger than stated — see §2. New: a pass-rate cliff at ~50k, not just the 262k cap.
A′	Hard 262k `400 exceeds context` kills whole trials	Confirmed, now 4 not 2 — added `circuit-fibsqrt`, `make-mips-interpreter`.
B	Single-turn runaway moved from thinking to plain text; DRY sampler + lower `max_tokens` will fix it	Partly wrong — see §3. Only 1 of 8 runaways is a text-loop DRY would catch. 6–7 are oversized file writes hitting the 32k cap; lowering `max_tokens` would regress them, and DRY does nothing for them.
C	Timeouts are real on a 3090	Confirmed & doubled — 10 `AgentTimeoutError` (was 4). New: 2 of them still PASSED (`qemu-startup`, `sqlite-db-truncate`) — a timeout is not an automatic loss.
D	"Confident false success" is the dominant scored failure; a verify-before-done preamble is the biggest win	Half-confirmed, mechanism is different — see §4. 65% of fails self-declare done, but half of those already ran a check — the problem is misreading/using-the-wrong-check and hidden criteria, not skipping verification. A naive preamble helps less than claimed.
E/F	Infra (2) + residual capability	Confirmed (mteb-retrieve, mteb-leaderboard = env-start; caffe-cifar-10, pytorch-model-recovery = non-zero exit).

The reasoning-budget finding also holds: max <think> across all 89 trials is ~18k chars (≈4–4.5k tokens); zero reasoning runaways. --reasoning-budget 4000 is doing its job.

2. Context: the real story is a pass-rate cliff at ~50k (NEW)

Peak request size vs pass rate over the 86 trials with a transcript:

peak context	n	pass rate
< 50k	48	60%
50k – 131k	16	6% (1/16)
131k – 200k	6	33% (small n)
200k – 300k	16	6% (1/16)

26% of trials (22/86) blew past the harness's own declared 131 072 window; 8 reached >250k and 0 of those 8 passed.
The single most predictive failure signal is not hitting the 262k cap — it's crossing ~50k context at all. The only pass in the entire 50–131k band was compile-compcert (173 short compile-iterate turns, low output/turn). Every other task that grew context into that band failed.
This is partly circular (hard tasks take more turns → more context), but the sharpness of the cliff makes it a usable early-abort / health signal: a trial north of ~60k context and still not done is almost certainly in a death spiral.

A′ (hard 400) is worse than the partial doc knew: 4 trials died on 400 request (2621xx tokens) exceeds the available context size (262144) — path-tracing-reverse, video-processing (known) plus circuit-fibsqrt, make-mips-interpreter. These are pure harness losses; the §5.1 fix (catch the 400 → compact/retry, declare window ~230k) still applies and now recovers 4 trials.

3. Runaway generation is mostly OVERSIZED WRITES, not text loops (NEW — corrects hypothesis B)

8 trials hit a 32 000-token length stop with large output. Splitting each runaway turn by its dominant content block:

Trial	reward	runaway kind
polyglot-rust-c	0	B1 text-loop ("let me try a different approach…" ×N, 97k chars)
dna-assembly	0	4× B2 oversized-write + 1× B1 (52k chars of `-A-G-G-T` repeat)
circuit-fibsqrt	0	6× B2 oversized-write (regenerates a ~100k-char gate-gen script each turn)
db-wal-recovery	0	1× B2 oversized-write
qemu-alpine-ssh	0	1× B2 oversized-write
regex-chess	0	2× B2 oversized-write
feal-linear-cryptanalysis	1	1× B2 oversized-write (passed anyway)
sqlite-db-truncate	1	4× B2 oversized-write (passed anyway)

Only polyglot-rust-c is the pure text-loop the partial doc built its case on. The rest are the model trying to emit a single tool call whose legitimate content (a generated program, a large heredoc) is >32k tokens, getting truncated, and retrying the whole thing. Consequences for the partial doc's recommendations:

Lowering max_tokens to 12–16k (partial §5.2 / NOTES #3) would REGRESS these tasks — they need >32k in one write. circuit-fibsqrt's write was ~30k tokens, right at the ceiling; a higher cap might let it complete, not a lower one.
The DRY sampler (partial §6.1) won't touch B2 — the output isn't repetitive, it's just long. DRY remains correct for the one B1 case and for the -A-G-G-T degenerate stretch inside dna-assembly.
Right fix for B2: teach the agent (via preamble) to write large artifacts programmatically (a small generator script, or chunked appends across turns) instead of pasting a 100k-char literal it then re-pastes on every retry. This is a scaffold/prompt change, orthogonal to sampler and token cap.

So B splits: B1 (text repetition) → DRY; B2 (oversized single write) → chunked-write scaffolding, and do not lower max_tokens blindly.

4. "Confident false success" is really "near-miss on a hidden criterion" (NEW — refines hypothesis D)

Across the full suite: 35 of 54 failing trials (65%) ended with the agent voluntarily declaring done (clean stop, no crash/timeout). But the partial doc's framing — "declared success without running the provided checks" — is only half right: 17 of those 35 had already run a check-like command. The failure is rarely "didn't verify"; it's one of:

Ran a check, but not the acceptance check. The agent self-tests against its own understanding, which differs from the hidden test_outputs.py.
Genuine near-miss on a single hidden subtest. Reward is binary, so these score 0 despite being one assertion away.
Spec misread — did the task, but not the exact form required.

Parsing hidden-subtest pass/fail for every failing trial: 28 of 54 fails passed at least one hidden subtest, and a striking cluster failed on exactly one:

Trial	subtests	what the single failure was
build-cython-ext	17 pass / 1 fail	one pre-existing lib incompat, unrelated to the task
financial-document-processor	6 / 1	one extraction rule
openssl-selfsigned-cert	5 / 1	its `check_cert.py` imports `cryptography`, absent in the verifier env
query-optimize	5 / 1	output correct, but 0.80× golden speed vs required ≥0.95×
cancel-async-tasks / llm-inference-batching-scheduler / large-scale-text-editing / mailman / sparql-university / winning-avg-corewars / train-fasttext	N/1	single correctness/format assertion

Implications: - A verify-before-done preamble helps case 1 but NOT cases 2–3 — the agent cannot see the hidden test, so "run the provided check" doesn't exist for many of these. The realistic win from the preamble is smaller than "the biggest scored win"; it mainly stops the fabrication cases (db-wal-recovery invented 6 DB records; still the clearest D example). - openssl-selfsigned-cert is a cautionary datapoint. It was one of the two smoke-set tasks RUNS.md tuned the reasoning budget on, and it failed here — not for any reason the budget/window/loop work addresses, but because the agent's own verification script depends on a package the verifier container lacks. A "make your check.py use only stdlib/CLI tools" instruction would recover it. Don't read this failure as a regression in the tuned knobs. - K=1 variance is real. With ~7 tasks one hidden assertion away and binary scoring, a K=3 validity run would likely lift the headline pass rate by a few points with no harness change — worth it before over-fitting the scaffold to single-shot noise.

5. Timeouts (refines hypothesis C)

10 AgentTimeoutError (up from 4): crack-7z-hash, feal-differential-cryptanalysis, gcode-to-text, gpt2-codegolf, make-mips-interpreter, model-extraction-relu-logits, qemu-alpine-ssh, qemu-startup, sqlite-db-truncate, write-compressor.

2 passed despite timing out (qemu-startup, sqlite-db-truncate): the agent produced the correct result before the wall clock killed it. AgentTimeoutError ≠ fail — this reinforces the existing decision to keep agent_timeout_multiplier=2.0 (see the "rejected speed-ups" note) rather than trimming it.
The other 8 are a mix of genuine long compute (crack-7z-hash at 1–2 CPU cores by task design), context-squeeze deaths (gcode-to-text, make-mips-interpreter both >257k ctx), and runaways (write-compressor 195k output tokens). Killing the runaways early (via the context health signal in §2 + B2 write scaffolding) is the lever, not changing the multiplier.

6. New hypotheses (to test on the smoke set)

Ordered by expected ROI / confidence.

H1 — Context growth is the master failure variable; add an early-abort / self-summarize trigger at ~50–60k. Below 50k → 60% pass; 50–131k → 6%. A harness hook that, on crossing ~50k, forces the agent to write a durable progress file and restart from a compacted, task-anchored summary would attack the biggest bucket. Test: instrument the smoke set; confirm the cliff reproduces, then A/B the trigger.
H2 — B2 oversized writes need chunked/programmatic writing, and max_tokens should stay ≥32k (maybe rise). Lowering it regresses circuit-fibsqrt, dna-assembly, regex-chess, qemu-alpine-ssh. Test: on circuit-fibsqrt, try (a) max_tokens=48k and (b) a preamble "generate large files with a script, never paste >500 lines in one tool call"; compare completion.
H3 — DRY sampler is narrowly useful (B1 only). Keep the partial doc's DRY experiment but expect it to fix polyglot-rust-c and the -A-G-G-T stretch, not the runaway count broadly. Test: A/B DRY on polyglot-rust-c (should drop its length-stop) vs circuit-fibsqrt (should NOT change).
H4 — The verify-before-done preamble's real yield is the fabrication subset + "make your own check.py dependency-free," not "run the hidden test." Test: preamble on db-wal-recovery (fabrication) and openssl-selfsigned-cert (self-check dependency) — both should flip; expect little movement on query-optimize (a real perf miss) or sparql-university (a correctness miss).
H5 — K=1 binary scoring understates the model; run K=3 before more tuning. ~7 tasks are one hidden assertion from passing. Test: K=3 on those 7 + a random 10; measure how many flip with zero code change.
H6 — The 262k hard-400 is a pure harness bug worth 4 trials. Highest-confidence mechanical fix, unchanged from partial §5.1: declare window ~230k and catch the 400 → compact/retry. Recovers path-tracing-reverse, video-processing, circuit-fibsqrt, make-mips-interpreter.

7. Suggested experiment order (updated)

Harness 262k-400 catch + declared window ~230k (H6) — 4 dead trials, cheap, highest confidence.
Context health trigger at ~50–60k (H1) — attacks the biggest failure bucket; needs the instrumentation the partial doc §5.5 already asked for.
Large-file write scaffolding + keep max_tokens≥32k (H2) — do NOT lower the cap; that was the one partial-doc recommendation the full data contradicts.
DRY sampler, scoped to B1 (H3) and verify/self-check preamble, scoped to fabrication + dependency-free checks (H4) — run together on the smoke set.
K=3 validity run (H5) before further scaffold tuning, to separate signal from single-shot noise.

Smoke set to exercise every mode: regex-log (baseline pass), polyglot-rust-c (B1 loop), circuit-fibsqrt (B2 oversized-write + 262k 400), path-tracing (context squeeze), db-wal-recovery (fabrication), openssl-selfsigned-cert (dependency-in-own-check near-miss).