← all runs

Full-suite analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (89/89 complete)

Author: automated analysis (Claude), 2026-07-04 Supersedes/extends: ANALYSIS-qwen3.6-35b-a3b-suite-partial.md (which scored 45 of 89 tasks). This doc analyses the 43 previously-pending tasks now that runs/suite__qwen3.6-35b-a3b__20260703-003556 has finished (started 2026-07-03T00:35Z, finished 2026-07-04T00:09Z, ~23.5 h wall), re-checks the partial doc's six hypotheses against all 89 trials, and adds new hypotheses that only became visible with the full data.

Metrics were extracted programmatically from each trial's agent/pi.txt (stopReasons, per-turn think/text/tool sizes, peak context, compactions) and verifier/test-stdout.txt (hidden subtest pass/fail). Script: /tmp/extract_suite_metrics.py.


0. Headline: the full suite is harder than the partial implied

Cohort Pass Fail Infra-noreward Pass rate (scored)
Partial (first 45, doc'd) 19 25 1 43%
New (back 44) 14 29 1 32%
Full 89 33 54 2 37.1%

The suite mean reward is 0.371 (result.json), not the 0.42 partial estimate — the second half of the run was a harder draw. Reward is binary per task (all hidden subtests must pass), which matters a lot for the near-miss finding below.

Timing pattern from the partial doc holds: passes are cheap (mean 7.3 min, 241 min total) and 83% of wall-clock went to non-passing trials (mean 20.9 min).


1. Verdict on the partial doc's six hypotheses

# Hypothesis (partial doc) Verdict on full data
A Context overflow / compaction is the biggest mechanical loss Confirmed & bigger than stated — see §2. New: a pass-rate cliff at ~50k, not just the 262k cap.
A′ Hard 262k 400 exceeds context kills whole trials Confirmed, now 4 not 2 — added circuit-fibsqrt, make-mips-interpreter.
B Single-turn runaway moved from thinking to plain text; DRY sampler + lower max_tokens will fix it Partly wrong — see §3. Only 1 of 8 runaways is a text-loop DRY would catch. 6–7 are oversized file writes hitting the 32k cap; lowering max_tokens would regress them, and DRY does nothing for them.
C Timeouts are real on a 3090 Confirmed & doubled — 10 AgentTimeoutError (was 4). New: 2 of them still PASSED (qemu-startup, sqlite-db-truncate) — a timeout is not an automatic loss.
D "Confident false success" is the dominant scored failure; a verify-before-done preamble is the biggest win Half-confirmed, mechanism is different — see §4. 65% of fails self-declare done, but half of those already ran a check — the problem is misreading/using-the-wrong-check and hidden criteria, not skipping verification. A naive preamble helps less than claimed.
E/F Infra (2) + residual capability Confirmed (mteb-retrieve, mteb-leaderboard = env-start; caffe-cifar-10, pytorch-model-recovery = non-zero exit).

The reasoning-budget finding also holds: max <think> across all 89 trials is ~18k chars (≈4–4.5k tokens); zero reasoning runaways. --reasoning-budget 4000 is doing its job.


2. Context: the real story is a pass-rate cliff at ~50k (NEW)

Peak request size vs pass rate over the 86 trials with a transcript:

peak context n pass rate
< 50k 48 60%
50k – 131k 16 6% (1/16)
131k – 200k 6 33% (small n)
200k – 300k 16 6% (1/16)

A′ (hard 400) is worse than the partial doc knew: 4 trials died on 400 request (2621xx tokens) exceeds the available context size (262144)path-tracing-reverse, video-processing (known) plus circuit-fibsqrt, make-mips-interpreter. These are pure harness losses; the §5.1 fix (catch the 400 → compact/retry, declare window ~230k) still applies and now recovers 4 trials.


3. Runaway generation is mostly OVERSIZED WRITES, not text loops (NEW — corrects hypothesis B)

8 trials hit a 32 000-token length stop with large output. Splitting each runaway turn by its dominant content block:

Trial reward runaway kind
polyglot-rust-c 0 B1 text-loop ("let me try a different approach…" ×N, 97k chars)
dna-assembly 0 B2 oversized-write + 1× B1 (52k chars of -A-G-G-T repeat)
circuit-fibsqrt 0 B2 oversized-write (regenerates a ~100k-char gate-gen script each turn)
db-wal-recovery 0 1× B2 oversized-write
qemu-alpine-ssh 0 1× B2 oversized-write
regex-chess 0 2× B2 oversized-write
feal-linear-cryptanalysis 1 1× B2 oversized-write (passed anyway)
sqlite-db-truncate 1 4× B2 oversized-write (passed anyway)

Only polyglot-rust-c is the pure text-loop the partial doc built its case on. The rest are the model trying to emit a single tool call whose legitimate content (a generated program, a large heredoc) is >32k tokens, getting truncated, and retrying the whole thing. Consequences for the partial doc's recommendations:

So B splits: B1 (text repetition) → DRY; B2 (oversized single write) → chunked-write scaffolding, and do not lower max_tokens blindly.


4. "Confident false success" is really "near-miss on a hidden criterion" (NEW — refines hypothesis D)

Across the full suite: 35 of 54 failing trials (65%) ended with the agent voluntarily declaring done (clean stop, no crash/timeout). But the partial doc's framing — "declared success without running the provided checks" — is only half right: 17 of those 35 had already run a check-like command. The failure is rarely "didn't verify"; it's one of:

  1. Ran a check, but not the acceptance check. The agent self-tests against its own understanding, which differs from the hidden test_outputs.py.
  2. Genuine near-miss on a single hidden subtest. Reward is binary, so these score 0 despite being one assertion away.
  3. Spec misread — did the task, but not the exact form required.

Parsing hidden-subtest pass/fail for every failing trial: 28 of 54 fails passed at least one hidden subtest, and a striking cluster failed on exactly one:

Trial subtests what the single failure was
build-cython-ext 17 pass / 1 fail one pre-existing lib incompat, unrelated to the task
financial-document-processor 6 / 1 one extraction rule
openssl-selfsigned-cert 5 / 1 its check_cert.py imports cryptography, absent in the verifier env
query-optimize 5 / 1 output correct, but 0.80× golden speed vs required ≥0.95×
cancel-async-tasks / llm-inference-batching-scheduler / large-scale-text-editing / mailman / sparql-university / winning-avg-corewars / train-fasttext N/1 single correctness/format assertion

Implications: - A verify-before-done preamble helps case 1 but NOT cases 2–3 — the agent cannot see the hidden test, so "run the provided check" doesn't exist for many of these. The realistic win from the preamble is smaller than "the biggest scored win"; it mainly stops the fabrication cases (db-wal-recovery invented 6 DB records; still the clearest D example). - openssl-selfsigned-cert is a cautionary datapoint. It was one of the two smoke-set tasks RUNS.md tuned the reasoning budget on, and it failed here — not for any reason the budget/window/loop work addresses, but because the agent's own verification script depends on a package the verifier container lacks. A "make your check.py use only stdlib/CLI tools" instruction would recover it. Don't read this failure as a regression in the tuned knobs. - K=1 variance is real. With ~7 tasks one hidden assertion away and binary scoring, a K=3 validity run would likely lift the headline pass rate by a few points with no harness change — worth it before over-fitting the scaffold to single-shot noise.


5. Timeouts (refines hypothesis C)

10 AgentTimeoutError (up from 4): crack-7z-hash, feal-differential-cryptanalysis, gcode-to-text, gpt2-codegolf, make-mips-interpreter, model-extraction-relu-logits, qemu-alpine-ssh, qemu-startup, sqlite-db-truncate, write-compressor.


6. New hypotheses (to test on the smoke set)

Ordered by expected ROI / confidence.

  1. H1 — Context growth is the master failure variable; add an early-abort / self-summarize trigger at ~50–60k. Below 50k → 60% pass; 50–131k → 6%. A harness hook that, on crossing ~50k, forces the agent to write a durable progress file and restart from a compacted, task-anchored summary would attack the biggest bucket. Test: instrument the smoke set; confirm the cliff reproduces, then A/B the trigger.

  2. H2 — B2 oversized writes need chunked/programmatic writing, and max_tokens should stay ≥32k (maybe rise). Lowering it regresses circuit-fibsqrt, dna-assembly, regex-chess, qemu-alpine-ssh. Test: on circuit-fibsqrt, try (a) max_tokens=48k and (b) a preamble "generate large files with a script, never paste >500 lines in one tool call"; compare completion.

  3. H3 — DRY sampler is narrowly useful (B1 only). Keep the partial doc's DRY experiment but expect it to fix polyglot-rust-c and the -A-G-G-T stretch, not the runaway count broadly. Test: A/B DRY on polyglot-rust-c (should drop its length-stop) vs circuit-fibsqrt (should NOT change).

  4. H4 — The verify-before-done preamble's real yield is the fabrication subset + "make your own check.py dependency-free," not "run the hidden test." Test: preamble on db-wal-recovery (fabrication) and openssl-selfsigned-cert (self-check dependency) — both should flip; expect little movement on query-optimize (a real perf miss) or sparql-university (a correctness miss).

  5. H5 — K=1 binary scoring understates the model; run K=3 before more tuning. ~7 tasks are one hidden assertion from passing. Test: K=3 on those 7 + a random 10; measure how many flip with zero code change.

  6. H6 — The 262k hard-400 is a pure harness bug worth 4 trials. Highest-confidence mechanical fix, unchanged from partial §5.1: declare window ~230k and catch the 400 → compact/retry. Recovers path-tracing-reverse, video-processing, circuit-fibsqrt, make-mips-interpreter.


7. Suggested experiment order (updated)

  1. Harness 262k-400 catch + declared window ~230k (H6) — 4 dead trials, cheap, highest confidence.
  2. Context health trigger at ~50–60k (H1) — attacks the biggest failure bucket; needs the instrumentation the partial doc §5.5 already asked for.
  3. Large-file write scaffolding + keep max_tokens≥32k (H2) — do NOT lower the cap; that was the one partial-doc recommendation the full data contradicts.
  4. DRY sampler, scoped to B1 (H3) and verify/self-check preamble, scoped to fabrication + dependency-free checks (H4) — run together on the smoke set.
  5. K=3 validity run (H5) before further scaffold tuning, to separate signal from single-shot noise.

Smoke set to exercise every mode: regex-log (baseline pass), polyglot-rust-c (B1 loop), circuit-fibsqrt (B2 oversized-write + 262k 400), path-tracing (context squeeze), db-wal-recovery (fabrication), openssl-selfsigned-cert (dependency-in-own-check near-miss).