← older: smoke__qwen3.6-35b-a3b__20260702-213821all runsnewer: smoke__qwen3.6-35b-a3b__20260704-121319

suite__qwen3.6-35b-a3b__20260703-003556

Full 89-task suite (complete) — 0.371 final; a ~50k context cliff dominates, runaways are oversized writes not text loops

What was run

The first full terminal-bench 2.0 suite (89 tasks × 1 attempt), qwen3.6-35b-a3b, thinking ON, server-side --reasoning-budget 4000, max_tokens=32768, context_window=131072, sequential (n_concurrent=1) on the single 3090. Purpose: breadth — find the failure modes the 2-task smoke set can't show. Wall clock ~23.5 h.

Results (final — all 89 scored)

Mean 0.371 (33 pass / 54 fail / 2 infra errors). The halfway estimate was 0.42; the back half was a harder draw (14/44). Two failure-mode reports with per-trial evidence: - ANALYSIS-qwen3.6-35b-a3b-suite-partial — first 45 tasks. - ANALYSIS-qwen3.6-35b-a3b-suite-full — all 89, the authoritative one; it re-checked the partial's hypotheses and corrected two of them.

Reproduce the per-trial metrics with make failures RUN=runs/suite__…003556 SORT=ctx (scripts/failure_modes.py).

Interpretation — confirmed, corrected, and new

  1. Context overflow — confirmed and the master variable. NEW: a pass-rate cliff at ~50k context (<50k → 60% pass; 50–131k → 6%; >250k → 0/8). 22 of 86 trials blew past the declared 131k window; the hard 400 exceeds context size killed 4 trials (path-tracing-reverse, video-processing, circuit-fibsqrt, make-mips-interpreter), not 2. Fix unchanged: declare window ~230k and catch the 400 → compact/retry.
  2. Runaway generation — CORRECTED. Only 1 of 8 length-stops is a text loop DRY would catch (polyglot-rust-c). 6–7 are oversized single file writes hitting the 32k cap (100k-char generated programs). So do NOT lower max_tokens — that regresses these; two of them even passed while truncating (feal-linear, sqlite-db-truncate). Right fix: chunked/programmatic large-file writing; keep max_tokens ≥32k. DRY stays scoped to text loops.
  3. "Confident false success" — mechanism refined. 35 of 54 fails self-declared done, but half already ran a check; the misses are on hidden criteria the agent can't see (28/54 fails passed ≥1 hidden subtest; 12 failed exactly one). A verify-before-done preamble mainly helps the fabrication subset (db-wal-recovery) and "make your check.py dependency-free" (openssl-selfsigned-cert failed only because its check imports cryptography, absent in the verifier env). Binary K=1 scoring understates the model.

Plus 10 wall-clock timeouts (not 4) — but 2 passed despite AgentTimeoutError (qemu-startup, sqlite-db-truncate), so a timeout ≠ a loss; keep the 2.0 multiplier. Reasoning-budget 4000 held (max <think> ~4–4.5k).

Next steps (ordered by expected ROI)

  1. Harness: catch the 262k 400 → compact & retry, and declare window ~230k (recovers 4 dead trials — highest confidence).
  2. Harness: add a context health trigger at ~50–60k (self-summarize/restart) — attacks the biggest failure bucket.
  3. Harness: large-file chunked-write scaffolding; keep max_tokens ≥32k (this reverses the partial doc's "lower to 12–16k" — the full data contradicts it).
  4. Server: DRY sampler scoped to text loops; Harness: verify/self-check preamble scoped to fabrication + dependency-free checks. A/B on the smoke set.
  5. Run K=3 on the ~12 one-assertion-away tasks before more scaffold tuning, to separate signal from single-shot noise. Then the q8_0 KV test and reasoning-budget 2000/4000/8000 sweep.

Related analysis: Full-suite analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (89/89 complete) · Failure-mode analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (partial suite)

Run details

modelllama-local/qwen3.6-35b-a3bagentharnesses.minimal_pi:MinimalPithinkingonreasoning budgetdirect (server/none)* — see journal for the authoritative mechanismmaxTokens / contextWindow32768 / 131072agent timeout ×2.0trials89 of 89 — 33 pass · 44 fail · 14 erroredmean reward0.37tokens (job total)237,015,828 in / 5,040,500 outstarted / finished2026-07-03T00:35 / 2026-07-04T00:09wall clock1413m21s

Errored trials: AgentTimeoutError (crack-7z-hash, feal-differential-cryptanalysis, gcode-to-text, gpt2-codegolf, make-mips-interpreter, model-extraction-relu-logits, qemu-alpine-ssh, qemu-startup, sqlite-db-truncate, write-compressor) · EnvironmentStartTimeoutError (mteb-leaderboard, mteb-retrieve) · NonZeroAgentExitCodeError (caffe-cifar-10, pytorch-model-recovery)

Tasks

adaptive-rejection-sampler — 0/1 passed

#resulttotalagentin/out tok
1FAIL24m24s22m23s5651189/98204

bn-fit-modify — 0/1 passed

#resulttotalagentin/out tok
1FAIL3m31s2m04s370263/11833

break-filter-js-from-html — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m47s7m22s573322/69178

build-cython-ext — 0/1 passed

#resulttotalagentin/out tok
1FAIL3m00s1m44s925483/6732

build-pmars — 1/1 passed

#resulttotalagentin/out tok
1PASS1m44s51s188282/4673

build-pov-ray — 0/1 passed

#resulttotalagentin/out tok
1FAIL43m41s42m26s37544875/242378

caffe-cifar-10 — 0/1 passed

#resulttotalagentin/out tok
1ERR16m37s15m24s909420/8700

cancel-async-tasks — 0/1 passed

#resulttotalagentin/out tok
1FAIL1m42s39s67889/5777

chess-best-move — 0/1 passed

#resulttotalagentin/out tok
1FAIL11m06s9m55s2846979/59392

circuit-fibsqrt — 0/1 passed

#resulttotalagentin/out tok
1FAIL67m12s65m46s1219520/256961

cobol-modernization — 1/1 passed

#resulttotalagentin/out tok
1PASS3m27s2m28s635107/23462

code-from-image — 1/1 passed

#resulttotalagentin/out tok
1PASS1m37s47s129487/4472

compile-compcert — 1/1 passed

#resulttotalagentin/out tok
1PASS57m29s56m05s6265266/21372

configure-git-webserver — 1/1 passed

#resulttotalagentin/out tok
1PASS2m04s45s100781/4170

constraints-scheduling — 1/1 passed

#resulttotalagentin/out tok
1PASS2m37s1m38s157858/16032

count-dataset-tokens — 1/1 passed

#resulttotalagentin/out tok
1PASS2m49s1m58s270700/7952

crack-7z-hash — 0/1 passed

#resulttotalagentin/out tok
1ERR31m27s30m00s185601/2857

custom-memory-heap-crash — 1/1 passed

#resulttotalagentin/out tok
1PASS5m22s2m23s488891/20993

db-wal-recovery — 0/1 passed

#resulttotalagentin/out tok
1FAIL23m54s22m44s2812282/144992

distribution-search — 1/1 passed

#resulttotalagentin/out tok
1PASS1m57s54s77800/9069

dna-assembly — 0/1 passed

#resulttotalagentin/out tok
1FAIL40m33s39m36s5245498/243639

dna-insert — 0/1 passed

#resulttotalagentin/out tok
1FAIL3m52s2m47s352138/27236

extract-elf — 1/1 passed

#resulttotalagentin/out tok
1PASS3m21s1m49s495165/16218

extract-moves-from-video — 0/1 passed

#resulttotalagentin/out tok
1FAIL54m09s53m10s7915626/146038

feal-differential-cryptanalysis — 0/1 passed

#resulttotalagentin/out tok
1ERR61m08s60m00s92348/29891

feal-linear-cryptanalysis — 1/1 passed

#resulttotalagentin/out tok
1PASS21m50s20m51s4396162/117480

filter-js-from-html — 0/1 passed

#resulttotalagentin/out tok
1FAIL5m47s1m38s69842/15993

financial-document-processor — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m03s6m59s1273960/59884

fix-code-vulnerability — 1/1 passed

#resulttotalagentin/out tok
1PASS1m41s35s343158/3326

fix-git — 1/1 passed

#resulttotalagentin/out tok
1PASS1m00s13s31139/1806

fix-ocaml-gc — 0/1 passed

#resulttotalagentin/out tok
1FAIL11m25s7m52s6656260/43867

gcode-to-text — 0/1 passed

#resulttotalagentin/out tok
1ERR30m50s30m00s8432280/185215

git-leak-recovery — 1/1 passed

#resulttotalagentin/out tok
1PASS1m16s10s25182/1489

git-multibranch — 0/1 passed

#resulttotalagentin/out tok
1FAIL3m40s2m20s812065/18555

gpt2-codegolf — 0/1 passed

#resulttotalagentin/out tok
1ERR30m50s30m00s12455075/194731

headless-terminal — 0/1 passed

#resulttotalagentin/out tok
1FAIL6m06s5m15s1109667/32403

hf-model-inference — 1/1 passed

#resulttotalagentin/out tok
1PASS11m10s44s25209/1720

install-windows-3.11 — 0/1 passed

#resulttotalagentin/out tok
1FAIL27m17s25m09s1539947/34877

kv-store-grpc — 1/1 passed

#resulttotalagentin/out tok
1PASS1m01s14s36653/1516

large-scale-text-editing — 0/1 passed

#resulttotalagentin/out tok
1FAIL21m21s20m18s5453899/84094

largest-eigenval — 1/1 passed

#resulttotalagentin/out tok
1PASS4m17s3m27s1118882/27474

llm-inference-batching-scheduler — 0/1 passed

#resulttotalagentin/out tok
1FAIL30m20s29m34s3549016/142096

log-summary-date-ranges — 1/1 passed

#resulttotalagentin/out tok
1PASS1m04s13s39177/2029

mailman — 0/1 passed

#resulttotalagentin/out tok
1FAIL6m39s3m24s1807952/19036

make-doom-for-mips — 0/1 passed

#resulttotalagentin/out tok
1FAIL31m03s29m19s21484975/159152

make-mips-interpreter — 0/1 passed

#resulttotalagentin/out tok
1ERR62m16s60m00s6626970/144216

mcmc-sampling-stan — 1/1 passed

#resulttotalagentin/out tok
1PASS8m41s5m59s139000/4257

merge-diff-arc-agi-task — 1/1 passed

#resulttotalagentin/out tok
1PASS1m46s50s211884/7312

model-extraction-relu-logits — 0/1 passed

#resulttotalagentin/out tok
1ERR30m59s30m00s4638774/131796

modernize-scientific-stack — 1/1 passed

#resulttotalagentin/out tok
1PASS1m43s12s18890/1647

mteb-leaderboard — 0/1 passed

#resulttotalagentin/out tok
1ERR10m00s-

mteb-retrieve — 0/1 passed

#resulttotalagentin/out tok
1ERR10m00s-

multi-source-data-merger — 1/1 passed

#resulttotalagentin/out tok
1PASS1m41s24s22856/3998

nginx-request-logging — 1/1 passed

#resulttotalagentin/out tok
1PASS1m05s14s39727/1904

openssl-selfsigned-cert — 0/1 passed

#resulttotalagentin/out tok
1FAIL1m04s17s39332/2410

overfull-hbox — 0/1 passed

#resulttotalagentin/out tok
1FAIL6m07s4m49s1253207/43302

password-recovery — 0/1 passed

#resulttotalagentin/out tok
1FAIL13m13s12m11s2616825/81951

path-tracing — 0/1 passed

#resulttotalagentin/out tok
1FAIL37m26s35m48s8087763/175815

path-tracing-reverse — 0/1 passed

#resulttotalagentin/out tok
1FAIL30m22s29m09s9012373/77453

polyglot-c-py — 0/1 passed

#resulttotalagentin/out tok
1FAIL12m42s11m32s1423231/98200

polyglot-rust-c — 0/1 passed

#resulttotalagentin/out tok
1FAIL20m25s19m02s1561818/143399

portfolio-optimization — 1/1 passed

#resulttotalagentin/out tok
1PASS2m39s49s32322/2204

protein-assembly — 0/1 passed

#resulttotalagentin/out tok
1FAIL48m34s47m42s13906625/228690

prove-plus-comm — 1/1 passed

#resulttotalagentin/out tok
1PASS2m13s32s53116/5524

pypi-server — 1/1 passed

#resulttotalagentin/out tok
1PASS3m18s53s44507/1790

pytorch-model-cli — 1/1 passed

#resulttotalagentin/out tok
1PASS8m04s6m05s312240/11532

pytorch-model-recovery — 0/1 passed

#resulttotalagentin/out tok
1ERR14m44s0s0/0

qemu-alpine-ssh — 0/1 passed

#resulttotalagentin/out tok
1ERR32m07s30m00s2612422/80665

qemu-startup — 1/1 passed

#resulttotalagentin/out tok
1ERR32m48s30m00s6191867/118646

query-optimize — 0/1 passed

#resulttotalagentin/out tok
1FAIL10m11s3m06s31987/2516

raman-fitting — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m06s7m18s989141/62065

regex-chess — 0/1 passed

#resulttotalagentin/out tok
1FAIL19m50s19m02s2086276/137677

regex-log — 1/1 passed

#resulttotalagentin/out tok
1PASS4m18s3m20s283649/31839

reshard-c4-data — 0/1 passed

#resulttotalagentin/out tok
1FAIL4m51s47s267756/5268

rstan-to-pystan — 1/1 passed

#resulttotalagentin/out tok
1PASS9m45s8m53s923212/16143

sam-cell-seg — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m59s6m27s1175648/24489

sanitize-git-repo — 0/1 passed

#resulttotalagentin/out tok
1FAIL1m53s51s346964/4633

schemelike-metacircular-eval — 0/1 passed

#resulttotalagentin/out tok
1FAIL17m40s16m51s7700547/87138

sparql-university — 0/1 passed

#resulttotalagentin/out tok
1FAIL2m05s1m06s76663/11789

sqlite-db-truncate — 1/1 passed

#resulttotalagentin/out tok
1ERR30m59s30m00s2008802/182956

sqlite-with-gcov — 0/1 passed

#resulttotalagentin/out tok
1FAIL1m36s36s34830/1392

torch-pipeline-parallelism — 0/1 passed

#resulttotalagentin/out tok
1FAIL23m46s17m43s3146984/73974

torch-tensor-parallelism — 0/1 passed

#resulttotalagentin/out tok
1FAIL6m41s52s38290/8624

train-fasttext — 0/1 passed

#resulttotalagentin/out tok
1FAIL67m09s65m20s1097321/7000

tune-mjcf — 1/1 passed

#resulttotalagentin/out tok
1PASS4m34s3m26s329930/21000

video-processing — 0/1 passed

#resulttotalagentin/out tok
1FAIL20m47s19m24s5688976/103706

vulnerable-secret — 1/1 passed

#resulttotalagentin/out tok
1PASS1m26s30s131933/4300

winning-avg-corewars — 0/1 passed

#resulttotalagentin/out tok
1FAIL8m30s7m11s708806/63103

write-compressor — 0/1 passed

#resulttotalagentin/out tok
1ERR31m22s30m00s4920094/195213