Failure-mode analysis — qwen3.6-35b-a3b MoE on terminal-bench 2.0 (partial suite)

Author: automated analysis (Claude), 2026-07-03 Goal of this doc: find the failure modes in the in-progress suite run and turn them into concrete changes to (a) the pi harness and (b) the llama.cpp / llama-swap server config, so a local qwen agent performs closer to its ceiling with no external LLM providers.

1. What this is based on (provenance — read this first)

Run analysed: runs/suite__qwen3.6-35b-a3b__20260703-003556/ — the live, still-running full suite. result.json id 1f919902-…, started 2026-07-03T00:35, last updated 10:38Z, not finished.
Model: llama-local/qwen3.6-35b-a3b (Qwen3.6-35B-A3B MoE, ~3B active, UD-IQ4_XS), served by llama-swap on port 8020.
Harness: harnesses/minimal_pi:MinimalPi, pi 0.73.1. Per-trial kwargs: max_tokens=32768, context_window=131072, llama_port=8020, thinking=on. This is the server-flag --reasoning-budget 4000 era (see runs/RUNS.md) — reasoning is capped server-side.
Benchmark: terminal-bench@2.0, git commit 69671fbaac6d67a7ef0dfec016cc38a64ef7a77c, 89 tasks total, K=1 attempt each, n_concurrent=1, agent_timeout_multiplier=2.0.
Coverage at time of analysis: 45 completed + 7 errored = 52 tasks done, 43 still pending (1 running). Numbers below are over the 45 tasks with a transcript. The earlier runs/suite__…235616/ run is a superseded stub (only 2 task dirs, abandoned) — ignore it.

Do not re-analyse the 46 task directories already present in …003556/ when the run finishes — they are covered here. The tasks left to analyse are the 43 in §3.2.

2. Headline numbers (45 scored tasks)

Outcome	Count	Rate
Pass (reward 1.0)	19	42%
Fail (reward 0.0)	25	56%
Errored (no reward)	1	2%

The suite result.json reports mean reward 0.42 over the 44 non-infra trials. Aggregate cost of the partial run: 136.2M input tokens (98% cache-read), 2.6M output tokens.

Passing tasks are cheap and short (median ~10–30 turns, <35k input tokens). Failures split into a small number of mechanical modes (fixable in the harness/server) and a residual of genuine model-capability misses.

3. Task inventory

3.1 Completed in this run (do not re-analyse)

adaptive-rejection-sampler, break-filter-js-from-html, build-pov-ray, caffe-cifar-10, chess-best-move, cobol-modernization, compile-compcert, configure-git-webserver, crack-7z-hash, custom-memory-heap-crash, db-wal-recovery, distribution-search, feal-linear-cryptanalysis, fix-git, git-multibranch, gpt2-codegolf, headless-terminal, hf-model-inference, kv-store-grpc, largest-eigenval, llm-inference-batching-scheduler, log-summary-date-ranges, merge-diff-arc-agi-task, modernize-scientific-stack, mteb-retrieve, multi-source-data-merger, overfull-hbox, password-recovery, path-tracing-reverse, path-tracing, polyglot-rust-c, portfolio-optimization, prove-plus-comm, pypi-server, pytorch-model-cli, qemu-alpine-ssh, qemu-startup, regex-chess, reshard-c4-data, schemelike-metacircular-eval, torch-tensor-parallelism, train-fasttext, video-processing, winning-avg-corewars, write-compressor

3.2 Pending — analyse these after the full run (43)

bn-fit-modify, build-cython-ext, build-pmars, cancel-async-tasks, circuit-fibsqrt, code-from-image, constraints-scheduling, count-dataset-tokens, dna-assembly, dna-insert, extract-elf, extract-moves-from-video, feal-differential-cryptanalysis, filter-js-from-html, financial-document-processor, fix-code-vulnerability, fix-ocaml-gc, gcode-to-text, git-leak-recovery, large-scale-text-editing, mailman, make-doom-for-mips, make-mips-interpreter, mcmc-sampling-stan, model-extraction-relu-logits, mteb-leaderboard, nginx-request-logging, openssl-selfsigned-cert, polyglot-c-py, protein-assembly, pytorch-model-recovery, query-optimize, raman-fitting, regex-log, rstan-to-pystan, sam-cell-seg, sanitize-git-repo, sparql-university, sqlite-db-truncate, sqlite-with-gcov, torch-pipeline-parallelism, tune-mjcf, vulnerable-secret

Note: the two smoke-suite staples regex-log and openssl-selfsigned-cert (the tasks RUNS.md tuned the reasoning budget on) are still pending in the full suite — good, they'll give a clean full-suite read on the 4000 budget.

4. Failure-mode taxonomy (with evidence)

Categorised the 26 non-passing trials. Categories overlap (a trial can hit several).

Mode	# trials	Fixable where
A. Context overflow forcing compaction	11	harness + server
A′. Hard 262k context 400-error mid-run (trial lost)	2	harness (highest ROI)
B. Single-turn runaway generation (32k `length` stop)	8	harness + server sampler
C. Agent timeout (ran out the 1800s wall clock)	4	server speed + harness
D. "Confident false success" (declared done, verifier failed)	~19	harness prompt/verify
E. Infra (env build timeout, agent non-zero exit)	2	infra, not model
F. Genuine capability miss	residual	model, not harness

A / A′ — Context management is the single biggest mechanical loss

The harness declares contextWindow: 131072, but the server's real window is -c 262144 (the model's full train context). Observed request sizes climbed to 259k–262k tokens, i.e. pi kept feeding context far past its own declared 131k window, and:

path-tracing-reverse and video-processing died on a hard 400 … (262608 tokens) exceeds the available context size (262144) — the turn errored (stopReason: error) and the whole trial was wasted. These two are pure harness losses, not model losses.
6 trials hit reactive compaction_start reason=overflow (pi only compacted after nearly overflowing); 6 hit proactive reason=threshold. The proactive threshold is clearly not anchored to the declared 131k — contexts reached 259k before compaction. Example regex-chess: input grew 2k → 140k with no proactive compaction; the only compaction_start fired at agent_end.

Why it matters: compaction is lossy — the model loses its working notes and often repeats work or declares premature success right after. And the two 400s are dead trials.

B — Single-turn runaway is now in plain text, not thinking

The server --reasoning-budget 4000 works: max <think> block across these trials is ~3.9–4.5k tokens (never near 32k). But 8 trials still hit a 32000-token length stop — and the runaway is now in ordinary assistant text, which the reasoning budget doesn't touch. Clearest case polyglot-rust-c: a 97,344-char single text block looping "OK, let me try a completely different approach… Wait, I've tried this…" until it hit the 32k ceiling. build-pov-ray burned 296 turns / 242k output tokens doing this. The reasoning cap plugged one leak; the model found another.

C — Timeouts are real on a 3090

4 tasks (crack-7z-hash, gpt2-codegolf, qemu-alpine-ssh, write-compressor) ran the full 1800s agent budget (base timeout × 2.0) and were killed. write-compressor streamed a 54 MB transcript / 195k output tokens — pure generation, not compute. These interact with B: a runaway text loop eats the wall clock.

D — "Confident false success" is the dominant scored failure

19 of 26 non-passing trials end with success language ("Done!", "14/14 PASS", "byte-for-byte match ✅", "The best move is…"). Examples: - db-wal-recovery: gave up on real decryption and fabricated the 6 missing records by "following the obvious pattern (alphabetical fruits, value = id×100)", then declared the DB recovered. - train-fasttext: "Done! model saved to /app/model.bin" — reward 0. - chess-best-move: confidently reported a Stockfish eval for the wrong move.

Several of these tasks ship a check.py/verifier the agent could have run (regex-chess explicitly says "You can look at the provided check.py"). The model mostly doesn't self-verify before declaring done.

E / F — not harness bugs

mteb-retrieve = EnvironmentStartTimeoutError (image build, infra). caffe-cifar-10 = NonZeroAgentExitCodeError (build never produced training_output.txt). The residual (e.g. chess-best-move, feal-style crypto, regex-chess correctness) is genuine model difficulty.

5. Recommendations — HARNESS (`harnesses/minimal_pi.py` / run config)

Ordered by expected ROI. The theme: make context and generation bounded and self-correcting, since a local model loops more than a frontier one.

Fix the context-window mismatch (highest ROI, ~cheap). Declared context_window (131072) is half the server -c (262144), yet contexts still overflowed to 262k and hard-400'd. Two independent fixes, do both: - Set the declared context_window a safe margin below the server -c (e.g. 230000 against a 262144 server) so pi's proactive compaction fires before the server's hard limit — never at/above it. The current 131072 is being ignored/outrun; align it with reality but keep a floor of headroom (≥ max_tokens + a few k). - Catch the 400 … exceeds the available context size response in the provider path and trigger compaction + retry instead of erroring the turn. This directly recovers path-tracing-reverse and video-processing (2 dead trials → ret/compact). pi's reactive compaction already exists; the harness just needs the 400 to route into it rather than surfacing as stopReason: error.
Bound single-turn generation without hurting legit big writes (mode B). max_tokens=32768 exists so large file writes aren't truncated — keep that for file-write tool calls, but the 32k ceiling is also what lets a plain-text reasoning loop run to 97k chars. Options, cheapest first: - Lower max_tokens to ~12–16k as the default and let the model chunk large writes across turns (most passing tasks used <6k output/turn; only runaways used 32k). Fail-fast on loops, minor cost to genuinely-huge single writes. - Add an anti-repetition stop (see server §6.3 — DRY sampler) so loops die on their own regardless of max_tokens. - Detect stopReason: length in the harness and inject a system nudge ("your last message was truncated for looping; stop and take a different concrete action") instead of silently continuing.
Add a "verify before you declare done" contract (mode D — biggest scored win). This is the highest-value prompt/scaffold change. In the harness' system preamble (pi supports an extension/prompt hook), instruct the agent to, before its final answer: (a) run any provided check.py/test, (b) re-read the file it claims to have written, (c) only claim success on observed passing output — never on "the pattern suggests…". db-wal-recovery, train-fasttext, reshard-c4-data, headless-terminal all had a way to self-check and skipped it. Even a modest lift here converts several near-misses.
Make compaction preserve a task-anchored summary. When pi compacts, ensure the retained summary re-states the original task text + the acceptance criteria + files already written. The post-compaction turns are where the model most often loses the thread and declares premature success.
Instrument for data-driven tuning (you asked for robust + data-driven). Emit per-trial, per-turn: stopReason counts, max <think> tokens, max text tokens, #compactions (threshold vs overflow), peak input tokens, whether any provided verifier was executed. scripts/summarize_run.py already exists — extend it to tabulate these so budget/window sweeps are one command. (This analysis was reconstructed by hand from pi.txt; bake it in.)
Per-task-class timeouts. Build/compile/train tasks (caffe, compcert, qemu, fasttext) legitimately need the wall clock; the 2.0 multiplier is right for them but wastes an hour on a text-loop task. Consider a lower agent timeout combined with anti-loop (#2) so runaways are killed in minutes, and reserve the long budget for tasks that are actually compute-bound.

6. Recommendations — SERVER (`llama_cpp_on_my_desktop/tools/llama-swap/config.yaml`)

The config is already thoughtful (full GPU offload, MTP speculative decode, per-model contexts, native reasoning budget). Concrete levers you may have missed:

6.1 Add anti-repetition / DRY sampling (directly targets mode B)

The qwen35moe_base sampler is --temp 0.6 --top-p 0.95 --top-k 20 with no repetition or DRY penalty. The plain-text loops in §4-B ("Wait, I've tried this… let me try a completely different approach…" ×N) are exactly what llama.cpp's DRY sampler suppresses. Try adding to the qwen bases:

--dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2

(and/or a mild --repeat-penalty 1.05 --repeat-last-n 256). This is a data-driven experiment: A/B it on polyglot-rust-c, build-pov-ray, path-tracing — the three worst loopers — and watch the length-stop count drop. Keep temp/top-p/top-k at Qwen's recommended values; DRY is orthogonal.

6.2 Reconsider KV cache quantization for long context

KV_TYPE=q4_0 (from compose env) is applied to both K and V at a 262k context. q4_0 KV is aggressive and is known to degrade long-context recall/reasoning — which is precisely where these agent trajectories live (100k–260k contexts). The MoE weights are only ~15.8 GB on a 24 GB card, so you likely have VRAM headroom. Test KV_TYPE=q8_0 (or K=q8_0 / V=q4_0) on a couple of long-context tasks (path-tracing, schemelike-metacircular-eval) and compare pass rate. If it fits, q8_0 KV is a cheap quality win for exactly the tasks that are overflowing.

6.3 The reasoning budget is validated — but sweep it on the full suite

--reasoning-budget 4000 is confirmed working (max think ≈3.9–4.5k, zero reasoning runaways). RUNS.md itself flags that 4000 was never swept for breadth. Now that the full suite is running, do the 2000 / 4000 / 8000 sweep it calls for — but note the new bottleneck is text generation (§4-B), not thinking, so pair the budget sweep with the DRY experiment (§6.1) rather than expecting the budget alone to help further.

6.4 Keep the context window at the server max, fix the client instead

Don't lower the server -c — the MoE fits 262k and you want the headroom. The overflow problem is a client/harness issue (§5.1): pi must compact below 262k. Server-side, the only useful change is to make sure the OOM/exceed response is a clean 400 the harness can catch (it already is).

6.5 Speed (helps mode C timeouts)

MTP speculative decode is already on (--spec-type draft-mtp --spec-draft-n-max 2). If timeouts stay common on the full suite, worth testing --spec-draft-n-max 3–4 (more draft tokens/step; MoE + MTP usually accepts them) and confirming -fa on (flash-attention) is active — both raise tok/s on the exact long-generation tasks that time out. Measure with the per-request eval time … tok/s already in the llama-server logs.

7. Suggested experiment order (fastest signal first)

Harness context fix (§5.1) — align context_window to ~230k and catch the 262k 400 → compact/retry. Recovers 2 dead trials immediately + reduces lossy compaction across 11.
DRY sampler (§6.1) — one-line server change, A/B on the 3 worst loopers.
Lower max_tokens to ~12–16k (§5.2) — fail-fast on remaining loops.
"Verify before done" preamble (§5.3) — largest scored upside, prompt-only.
q8_0 KV test (§6.2) and reasoning-budget sweep (§6.3) — run against the full suite once it completes, using the extended summarize_run.py metrics.

Re-run the smoke set (regex-log, openssl-selfsigned-cert, plus one looper like polyglot-rust-c and one overflow task like path-tracing) after each change — that four-task set exercises every mechanical mode above and gives a fast read.