Author: automated analysis (Claude), 2026-07-04
Supersedes/extends: ANALYSIS-qwen3.6-35b-a3b-suite-partial.md (which scored 45
of 89 tasks). This doc analyses the 43 previously-pending tasks now that
runs/suite__qwen3.6-35b-a3b__20260703-003556 has finished (started
2026-07-03T00:35Z, finished 2026-07-04T00:09Z, ~23.5 h wall), re-checks the
partial doc's six hypotheses against all 89 trials, and adds new hypotheses that
only became visible with the full data.
Metrics were extracted programmatically from each trial's agent/pi.txt
(stopReasons, per-turn think/text/tool sizes, peak context, compactions) and
verifier/test-stdout.txt (hidden subtest pass/fail). Script: /tmp/extract_suite_metrics.py.
| Cohort | Pass | Fail | Infra-noreward | Pass rate (scored) |
|---|---|---|---|---|
| Partial (first 45, doc'd) | 19 | 25 | 1 | 43% |
| New (back 44) | 14 | 29 | 1 | 32% |
| Full 89 | 33 | 54 | 2 | 37.1% |
The suite mean reward is 0.371 (result.json), not the 0.42 partial estimate — the second half of the run was a harder draw. Reward is binary per task (all hidden subtests must pass), which matters a lot for the near-miss finding below.
Timing pattern from the partial doc holds: passes are cheap (mean 7.3 min, 241 min total) and 83% of wall-clock went to non-passing trials (mean 20.9 min).
| # | Hypothesis (partial doc) | Verdict on full data |
|---|---|---|
| A | Context overflow / compaction is the biggest mechanical loss | Confirmed & bigger than stated — see §2. New: a pass-rate cliff at ~50k, not just the 262k cap. |
| A′ | Hard 262k 400 exceeds context kills whole trials |
Confirmed, now 4 not 2 — added circuit-fibsqrt, make-mips-interpreter. |
| B | Single-turn runaway moved from thinking to plain text; DRY sampler + lower max_tokens will fix it |
Partly wrong — see §3. Only 1 of 8 runaways is a text-loop DRY would catch. 6–7 are oversized file writes hitting the 32k cap; lowering max_tokens would regress them, and DRY does nothing for them. |
| C | Timeouts are real on a 3090 | Confirmed & doubled — 10 AgentTimeoutError (was 4). New: 2 of them still PASSED (qemu-startup, sqlite-db-truncate) — a timeout is not an automatic loss. |
| D | "Confident false success" is the dominant scored failure; a verify-before-done preamble is the biggest win | Half-confirmed, mechanism is different — see §4. 65% of fails self-declare done, but half of those already ran a check — the problem is misreading/using-the-wrong-check and hidden criteria, not skipping verification. A naive preamble helps less than claimed. |
| E/F | Infra (2) + residual capability | Confirmed (mteb-retrieve, mteb-leaderboard = env-start; caffe-cifar-10, pytorch-model-recovery = non-zero exit). |
The reasoning-budget finding also holds: max <think> across all 89 trials is
~18k chars (≈4–4.5k tokens); zero reasoning runaways. --reasoning-budget 4000
is doing its job.
Peak request size vs pass rate over the 86 trials with a transcript:
| peak context | n | pass rate |
|---|---|---|
| < 50k | 48 | 60% |
| 50k – 131k | 16 | 6% (1/16) |
| 131k – 200k | 6 | 33% (small n) |
| 200k – 300k | 16 | 6% (1/16) |
compile-compcert (173 short compile-iterate turns, low output/turn). Every
other task that grew context into that band failed.A′ (hard 400) is worse than the partial doc knew: 4 trials died on
400 request (2621xx tokens) exceeds the available context size (262144) —
path-tracing-reverse, video-processing (known) plus circuit-fibsqrt,
make-mips-interpreter. These are pure harness losses; the §5.1 fix (catch the
400 → compact/retry, declare window ~230k) still applies and now recovers 4 trials.
8 trials hit a 32 000-token length stop with large output. Splitting each
runaway turn by its dominant content block:
| Trial | reward | runaway kind |
|---|---|---|
| polyglot-rust-c | 0 | B1 text-loop ("let me try a different approach…" ×N, 97k chars) |
| dna-assembly | 0 | 4× B2 oversized-write + 1× B1 (52k chars of -A-G-G-T repeat) |
| circuit-fibsqrt | 0 | 6× B2 oversized-write (regenerates a ~100k-char gate-gen script each turn) |
| db-wal-recovery | 0 | 1× B2 oversized-write |
| qemu-alpine-ssh | 0 | 1× B2 oversized-write |
| regex-chess | 0 | 2× B2 oversized-write |
| feal-linear-cryptanalysis | 1 | 1× B2 oversized-write (passed anyway) |
| sqlite-db-truncate | 1 | 4× B2 oversized-write (passed anyway) |
Only polyglot-rust-c is the pure text-loop the partial doc built its case on.
The rest are the model trying to emit a single tool call whose legitimate content
(a generated program, a large heredoc) is >32k tokens, getting truncated, and
retrying the whole thing. Consequences for the partial doc's recommendations:
max_tokens to 12–16k (partial §5.2 / NOTES #3) would REGRESS these
tasks — they need >32k in one write. circuit-fibsqrt's write was ~30k tokens,
right at the ceiling; a higher cap might let it complete, not a lower one.-A-G-G-T
degenerate stretch inside dna-assembly.So B splits: B1 (text repetition) → DRY; B2 (oversized single write) →
chunked-write scaffolding, and do not lower max_tokens blindly.
Across the full suite: 35 of 54 failing trials (65%) ended with the agent
voluntarily declaring done (clean stop, no crash/timeout). But the partial
doc's framing — "declared success without running the provided checks" — is only
half right: 17 of those 35 had already run a check-like command. The failure is
rarely "didn't verify"; it's one of:
test_outputs.py.Parsing hidden-subtest pass/fail for every failing trial: 28 of 54 fails passed at least one hidden subtest, and a striking cluster failed on exactly one:
| Trial | subtests | what the single failure was |
|---|---|---|
| build-cython-ext | 17 pass / 1 fail | one pre-existing lib incompat, unrelated to the task |
| financial-document-processor | 6 / 1 | one extraction rule |
| openssl-selfsigned-cert | 5 / 1 | its check_cert.py imports cryptography, absent in the verifier env |
| query-optimize | 5 / 1 | output correct, but 0.80× golden speed vs required ≥0.95× |
| cancel-async-tasks / llm-inference-batching-scheduler / large-scale-text-editing / mailman / sparql-university / winning-avg-corewars / train-fasttext | N/1 | single correctness/format assertion |
Implications:
- A verify-before-done preamble helps case 1 but NOT cases 2–3 — the agent
cannot see the hidden test, so "run the provided check" doesn't exist for many of
these. The realistic win from the preamble is smaller than "the biggest scored
win"; it mainly stops the fabrication cases (db-wal-recovery invented 6 DB
records; still the clearest D example).
- openssl-selfsigned-cert is a cautionary datapoint. It was one of the two
smoke-set tasks RUNS.md tuned the reasoning budget on, and it failed here —
not for any reason the budget/window/loop work addresses, but because the agent's
own verification script depends on a package the verifier container lacks. A
"make your check.py use only stdlib/CLI tools" instruction would recover it. Don't
read this failure as a regression in the tuned knobs.
- K=1 variance is real. With ~7 tasks one hidden assertion away and binary
scoring, a K=3 validity run would likely lift the headline pass rate by a few
points with no harness change — worth it before over-fitting the scaffold to
single-shot noise.
10 AgentTimeoutError (up from 4): crack-7z-hash, feal-differential-cryptanalysis,
gcode-to-text, gpt2-codegolf, make-mips-interpreter, model-extraction-relu-logits,
qemu-alpine-ssh, qemu-startup, sqlite-db-truncate, write-compressor.
qemu-startup, sqlite-db-truncate): the agent
produced the correct result before the wall clock killed it. AgentTimeoutError ≠
fail — this reinforces the existing decision to keep agent_timeout_multiplier=2.0
(see the "rejected speed-ups" note) rather than trimming it.crack-7z-hash at 1–2 CPU cores by
task design), context-squeeze deaths (gcode-to-text, make-mips-interpreter
both >257k ctx), and runaways (write-compressor 195k output tokens). Killing the
runaways early (via the context health signal in §2 + B2 write scaffolding) is the
lever, not changing the multiplier.Ordered by expected ROI / confidence.
H1 — Context growth is the master failure variable; add an early-abort / self-summarize trigger at ~50–60k. Below 50k → 60% pass; 50–131k → 6%. A harness hook that, on crossing ~50k, forces the agent to write a durable progress file and restart from a compacted, task-anchored summary would attack the biggest bucket. Test: instrument the smoke set; confirm the cliff reproduces, then A/B the trigger.
H2 — B2 oversized writes need chunked/programmatic writing, and max_tokens
should stay ≥32k (maybe rise). Lowering it regresses circuit-fibsqrt,
dna-assembly, regex-chess, qemu-alpine-ssh. Test: on circuit-fibsqrt,
try (a) max_tokens=48k and (b) a preamble "generate large files with a script,
never paste >500 lines in one tool call"; compare completion.
H3 — DRY sampler is narrowly useful (B1 only). Keep the partial doc's DRY
experiment but expect it to fix polyglot-rust-c and the -A-G-G-T stretch, not
the runaway count broadly. Test: A/B DRY on polyglot-rust-c (should drop its
length-stop) vs circuit-fibsqrt (should NOT change).
H4 — The verify-before-done preamble's real yield is the fabrication subset +
"make your own check.py dependency-free," not "run the hidden test." Test:
preamble on db-wal-recovery (fabrication) and openssl-selfsigned-cert
(self-check dependency) — both should flip; expect little movement on
query-optimize (a real perf miss) or sparql-university (a correctness miss).
H5 — K=1 binary scoring understates the model; run K=3 before more tuning. ~7 tasks are one hidden assertion from passing. Test: K=3 on those 7 + a random 10; measure how many flip with zero code change.
H6 — The 262k hard-400 is a pure harness bug worth 4 trials. Highest-confidence
mechanical fix, unchanged from partial §5.1: declare window ~230k and catch
the 400 → compact/retry. Recovers path-tracing-reverse, video-processing,
circuit-fibsqrt, make-mips-interpreter.
max_tokens≥32k (H2) — do NOT lower the
cap; that was the one partial-doc recommendation the full data contradicts.Smoke set to exercise every mode: regex-log (baseline pass), polyglot-rust-c
(B1 loop), circuit-fibsqrt (B2 oversized-write + 262k 400), path-tracing
(context squeeze), db-wal-recovery (fabrication), openssl-selfsigned-cert
(dependency-in-own-check near-miss).