Full 89-task suite (complete) — 0.371 final; a ~50k context cliff dominates, runaways are oversized writes not text loops

What was run

The first full terminal-bench 2.0 suite (89 tasks × 1 attempt), qwen3.6-35b-a3b, thinking ON, server-side --reasoning-budget 4000, max_tokens=32768, context_window=131072, sequential (n_concurrent=1) on the single 3090. Purpose: breadth — find the failure modes the 2-task smoke set can't show. Wall clock ~23.5 h.

Results (final — all 89 scored)

Mean 0.371 (33 pass / 54 fail / 2 infra errors). The halfway estimate was 0.42; the back half was a harder draw (14/44). Two failure-mode reports with per-trial evidence: - ANALYSIS-qwen3.6-35b-a3b-suite-partial — first 45 tasks. - ANALYSIS-qwen3.6-35b-a3b-suite-full — all 89, the authoritative one; it re-checked the partial's hypotheses and corrected two of them.

Reproduce the per-trial metrics with make failures RUN=runs/suite__…003556 SORT=ctx (scripts/failure_modes.py).

Interpretation — confirmed, corrected, and new

Context overflow — confirmed and the master variable. NEW: a pass-rate cliff at ~50k context (<50k → 60% pass; 50–131k → 6%; >250k → 0/8). 22 of 86 trials blew past the declared 131k window; the hard 400 exceeds context size killed 4 trials (path-tracing-reverse, video-processing, circuit-fibsqrt, make-mips-interpreter), not 2. Fix unchanged: declare window ~230k and catch the 400 → compact/retry.
Runaway generation — CORRECTED. Only 1 of 8 length-stops is a text loop DRY would catch (polyglot-rust-c). 6–7 are oversized single file writes hitting the 32k cap (100k-char generated programs). So do NOT lower max_tokens — that regresses these; two of them even passed while truncating (feal-linear, sqlite-db-truncate). Right fix: chunked/programmatic large-file writing; keep max_tokens ≥32k. DRY stays scoped to text loops.
"Confident false success" — mechanism refined. 35 of 54 fails self-declared done, but half already ran a check; the misses are on hidden criteria the agent can't see (28/54 fails passed ≥1 hidden subtest; 12 failed exactly one). A verify-before-done preamble mainly helps the fabrication subset (db-wal-recovery) and "make your check.py dependency-free" (openssl-selfsigned-cert failed only because its check imports cryptography, absent in the verifier env). Binary K=1 scoring understates the model.

Plus 10 wall-clock timeouts (not 4) — but 2 passed despite AgentTimeoutError (qemu-startup, sqlite-db-truncate), so a timeout ≠ a loss; keep the 2.0 multiplier. Reasoning-budget 4000 held (max <think> ~4–4.5k).

Next steps (ordered by expected ROI)

Harness: catch the 262k 400 → compact & retry, and declare window ~230k (recovers 4 dead trials — highest confidence).
Harness: add a context health trigger at ~50–60k (self-summarize/restart) — attacks the biggest failure bucket.
Harness: large-file chunked-write scaffolding; keep max_tokens ≥32k (this reverses the partial doc's "lower to 12–16k" — the full data contradicts it).
Server: DRY sampler scoped to text loops; Harness: verify/self-check preamble scoped to fabrication + dependency-free checks. A/B on the smoke set.
Run K=3 on the ~12 one-assertion-away tasks before more scaffold tuning, to separate signal from single-shot noise. Then the q8_0 KV test and reasoning-budget 2000/4000/8000 sweep.

Errored trials: AgentTimeoutError (crack-7z-hash, feal-differential-cryptanalysis, gcode-to-text, gpt2-codegolf, make-mips-interpreter, model-extraction-relu-logits, qemu-alpine-ssh, qemu-startup, sqlite-db-truncate, write-compressor) · EnvironmentStartTimeoutError (mteb-leaderboard, mteb-retrieve) · NonZeroAgentExitCodeError (caffe-cifar-10, pytorch-model-recovery)

suite__qwen3.6-35b-a3b__20260703-003556

Full 89-task suite (complete) — 0.371 final; a ~50k context cliff dominates, runaways are oversized writes not text loops

What was run

Results (final — all 89 scored)

Interpretation — confirmed, corrected, and new

Next steps (ordered by expected ROI)

Run details

Tasks

adaptive-rejection-sampler — 0/1 passed

bn-fit-modify — 0/1 passed

break-filter-js-from-html — 0/1 passed

build-cython-ext — 0/1 passed

build-pmars — 1/1 passed

build-pov-ray — 0/1 passed

caffe-cifar-10 — 0/1 passed

cancel-async-tasks — 0/1 passed

chess-best-move — 0/1 passed

circuit-fibsqrt — 0/1 passed

cobol-modernization — 1/1 passed

code-from-image — 1/1 passed

compile-compcert — 1/1 passed

configure-git-webserver — 1/1 passed

constraints-scheduling — 1/1 passed

count-dataset-tokens — 1/1 passed

crack-7z-hash — 0/1 passed

custom-memory-heap-crash — 1/1 passed

db-wal-recovery — 0/1 passed

distribution-search — 1/1 passed

dna-assembly — 0/1 passed

dna-insert — 0/1 passed

extract-elf — 1/1 passed

extract-moves-from-video — 0/1 passed

feal-differential-cryptanalysis — 0/1 passed

feal-linear-cryptanalysis — 1/1 passed

filter-js-from-html — 0/1 passed

financial-document-processor — 0/1 passed

fix-code-vulnerability — 1/1 passed

fix-git — 1/1 passed

fix-ocaml-gc — 0/1 passed

gcode-to-text — 0/1 passed

git-leak-recovery — 1/1 passed

git-multibranch — 0/1 passed

gpt2-codegolf — 0/1 passed

headless-terminal — 0/1 passed

hf-model-inference — 1/1 passed

install-windows-3.11 — 0/1 passed

kv-store-grpc — 1/1 passed

large-scale-text-editing — 0/1 passed

largest-eigenval — 1/1 passed

llm-inference-batching-scheduler — 0/1 passed

log-summary-date-ranges — 1/1 passed

mailman — 0/1 passed

make-doom-for-mips — 0/1 passed

make-mips-interpreter — 0/1 passed

mcmc-sampling-stan — 1/1 passed

merge-diff-arc-agi-task — 1/1 passed

model-extraction-relu-logits — 0/1 passed

modernize-scientific-stack — 1/1 passed

mteb-leaderboard — 0/1 passed

mteb-retrieve — 0/1 passed

multi-source-data-merger — 1/1 passed

nginx-request-logging — 1/1 passed

openssl-selfsigned-cert — 0/1 passed

overfull-hbox — 0/1 passed

password-recovery — 0/1 passed

path-tracing — 0/1 passed

path-tracing-reverse — 0/1 passed

polyglot-c-py — 0/1 passed

polyglot-rust-c — 0/1 passed

portfolio-optimization — 1/1 passed

protein-assembly — 0/1 passed

prove-plus-comm — 1/1 passed

pypi-server — 1/1 passed

pytorch-model-cli — 1/1 passed

pytorch-model-recovery — 0/1 passed

qemu-alpine-ssh — 0/1 passed

qemu-startup — 1/1 passed

query-optimize — 0/1 passed

raman-fitting — 0/1 passed

regex-chess — 0/1 passed

suiteqwen3.6-35b-a3b20260703-003556