Full 89-task suite (complete) — 0.371 final; a ~50k context cliff dominates, runaways are oversized writes not text loops
What was run
The first full terminal-bench 2.0 suite (89 tasks × 1 attempt),
qwen3.6-35b-a3b, thinking ON, server-side --reasoning-budget 4000,
max_tokens=32768, context_window=131072, sequential (n_concurrent=1) on
the single 3090. Purpose: breadth — find the failure modes the 2-task smoke
set can't show. Wall clock ~23.5 h.
Results (final — all 89 scored)
Mean 0.371 (33 pass / 54 fail / 2 infra errors). The halfway estimate was
0.42; the back half was a harder draw (14/44). Two failure-mode reports with
per-trial evidence:
- ANALYSIS-qwen3.6-35b-a3b-suite-partial — first 45 tasks.
- ANALYSIS-qwen3.6-35b-a3b-suite-full — all 89, the authoritative one; it
re-checked the partial's hypotheses and corrected two of them.
Reproduce the per-trial metrics with make failures RUN=runs/suite__…003556
SORT=ctx (scripts/failure_modes.py).
Interpretation — confirmed, corrected, and new
Context overflow — confirmed and the master variable. NEW: a pass-rate
cliff at ~50k context (<50k → 60% pass; 50–131k → 6%; >250k → 0/8). 22 of
86 trials blew past the declared 131k window; the hard 400 exceeds context
size killed 4 trials (path-tracing-reverse, video-processing,
circuit-fibsqrt, make-mips-interpreter), not 2. Fix unchanged: declare window
~230k and catch the 400 → compact/retry.
Runaway generation — CORRECTED. Only 1 of 8 length-stops is a text
loop DRY would catch (polyglot-rust-c). 6–7 are oversized single file
writes hitting the 32k cap (100k-char generated programs). So do NOT
lower max_tokens — that regresses these; two of them even passed while
truncating (feal-linear, sqlite-db-truncate). Right fix: chunked/programmatic
large-file writing; keep max_tokens ≥32k. DRY stays scoped to text loops.
"Confident false success" — mechanism refined. 35 of 54 fails
self-declared done, but half already ran a check; the misses are on
hidden criteria the agent can't see (28/54 fails passed ≥1 hidden subtest;
12 failed exactly one). A verify-before-done preamble mainly helps the
fabrication subset (db-wal-recovery) and "make your check.py dependency-free"
(openssl-selfsigned-cert failed only because its check imports cryptography,
absent in the verifier env). Binary K=1 scoring understates the model.
Plus 10 wall-clock timeouts (not 4) — but 2 passed despite
AgentTimeoutError (qemu-startup, sqlite-db-truncate), so a timeout ≠ a loss;
keep the 2.0 multiplier. Reasoning-budget 4000 held (max <think> ~4–4.5k).
Next steps (ordered by expected ROI)
Harness: catch the 262k 400 → compact & retry, and declare window ~230k
(recovers 4 dead trials — highest confidence).
Harness: add a context health trigger at ~50–60k (self-summarize/restart) —
attacks the biggest failure bucket.
Harness: large-file chunked-write scaffolding; keep max_tokens ≥32k
(this reverses the partial doc's "lower to 12–16k" — the full data
contradicts it).
Server: DRY sampler scoped to text loops; Harness: verify/self-check preamble
scoped to fabrication + dependency-free checks. A/B on the smoke set.
Run K=3 on the ~12 one-assertion-away tasks before more scaffold tuning,
to separate signal from single-shot noise. Then the q8_0 KV test and
reasoning-budget 2000/4000/8000 sweep.