Author: automated analysis (Claude), 2026-07-03 Goal of this doc: find the failure modes in the in-progress suite run and turn them into concrete changes to (a) the pi harness and (b) the llama.cpp / llama-swap server config, so a local qwen agent performs closer to its ceiling with no external LLM providers.
runs/suite__qwen3.6-35b-a3b__20260703-003556/ — the live,
still-running full suite. result.json id 1f919902-…, started
2026-07-03T00:35, last updated 10:38Z, not finished.llama-local/qwen3.6-35b-a3b (Qwen3.6-35B-A3B MoE, ~3B active,
UD-IQ4_XS), served by llama-swap on port 8020.harnesses/minimal_pi:MinimalPi, pi 0.73.1. Per-trial kwargs:
max_tokens=32768, context_window=131072, llama_port=8020, thinking=on.
This is the server-flag --reasoning-budget 4000 era (see runs/RUNS.md) —
reasoning is capped server-side.terminal-bench@2.0, git commit
69671fbaac6d67a7ef0dfec016cc38a64ef7a77c, 89 tasks total, K=1 attempt each,
n_concurrent=1, agent_timeout_multiplier=2.0.runs/suite__…235616/ run is a superseded stub
(only 2 task dirs, abandoned) — ignore it.Do not re-analyse the 46 task directories already present in
…003556/ when the run finishes — they are covered here. The tasks left to
analyse are the 43 in §3.2.
| Outcome | Count | Rate |
|---|---|---|
| Pass (reward 1.0) | 19 | 42% |
| Fail (reward 0.0) | 25 | 56% |
| Errored (no reward) | 1 | 2% |
The suite result.json reports mean reward 0.42 over the 44 non-infra trials.
Aggregate cost of the partial run: 136.2M input tokens (98% cache-read),
2.6M output tokens.
Passing tasks are cheap and short (median ~10–30 turns, <35k input tokens). Failures split into a small number of mechanical modes (fixable in the harness/server) and a residual of genuine model-capability misses.
adaptive-rejection-sampler, break-filter-js-from-html, build-pov-ray,
caffe-cifar-10, chess-best-move, cobol-modernization, compile-compcert,
configure-git-webserver, crack-7z-hash, custom-memory-heap-crash, db-wal-recovery,
distribution-search, feal-linear-cryptanalysis, fix-git, git-multibranch,
gpt2-codegolf, headless-terminal, hf-model-inference, kv-store-grpc,
largest-eigenval, llm-inference-batching-scheduler, log-summary-date-ranges,
merge-diff-arc-agi-task, modernize-scientific-stack, mteb-retrieve,
multi-source-data-merger, overfull-hbox, password-recovery, path-tracing-reverse,
path-tracing, polyglot-rust-c, portfolio-optimization, prove-plus-comm,
pypi-server, pytorch-model-cli, qemu-alpine-ssh, qemu-startup, regex-chess,
reshard-c4-data, schemelike-metacircular-eval, torch-tensor-parallelism,
train-fasttext, video-processing, winning-avg-corewars, write-compressor
bn-fit-modify, build-cython-ext, build-pmars, cancel-async-tasks, circuit-fibsqrt,
code-from-image, constraints-scheduling, count-dataset-tokens, dna-assembly,
dna-insert, extract-elf, extract-moves-from-video, feal-differential-cryptanalysis,
filter-js-from-html, financial-document-processor, fix-code-vulnerability,
fix-ocaml-gc, gcode-to-text, git-leak-recovery, large-scale-text-editing, mailman,
make-doom-for-mips, make-mips-interpreter, mcmc-sampling-stan,
model-extraction-relu-logits, mteb-leaderboard, nginx-request-logging,
openssl-selfsigned-cert, polyglot-c-py, protein-assembly, pytorch-model-recovery,
query-optimize, raman-fitting, regex-log, rstan-to-pystan, sam-cell-seg,
sanitize-git-repo, sparql-university, sqlite-db-truncate, sqlite-with-gcov,
torch-pipeline-parallelism, tune-mjcf, vulnerable-secret
Note: the two smoke-suite staples regex-log and openssl-selfsigned-cert (the
tasks RUNS.md tuned the reasoning budget on) are still pending in the full
suite — good, they'll give a clean full-suite read on the 4000 budget.
Categorised the 26 non-passing trials. Categories overlap (a trial can hit several).
| Mode | # trials | Fixable where |
|---|---|---|
| A. Context overflow forcing compaction | 11 | harness + server |
| A′. Hard 262k context 400-error mid-run (trial lost) | 2 | harness (highest ROI) |
B. Single-turn runaway generation (32k length stop) |
8 | harness + server sampler |
| C. Agent timeout (ran out the 1800s wall clock) | 4 | server speed + harness |
| D. "Confident false success" (declared done, verifier failed) | ~19 | harness prompt/verify |
| E. Infra (env build timeout, agent non-zero exit) | 2 | infra, not model |
| F. Genuine capability miss | residual | model, not harness |
The harness declares contextWindow: 131072, but the server's real window is
-c 262144 (the model's full train context). Observed request sizes climbed to
259k–262k tokens, i.e. pi kept feeding context far past its own declared
131k window, and:
path-tracing-reverse and video-processing died on a hard
400 … (262608 tokens) exceeds the available context size (262144) — the turn
errored (stopReason: error) and the whole trial was wasted. These two are
pure harness losses, not model losses.compaction_start reason=overflow (pi only compacted
after nearly overflowing); 6 hit proactive reason=threshold. The proactive
threshold is clearly not anchored to the declared 131k — contexts reached
259k before compaction. Example regex-chess: input grew 2k → 140k with no
proactive compaction; the only compaction_start fired at agent_end.Why it matters: compaction is lossy — the model loses its working notes and often repeats work or declares premature success right after. And the two 400s are dead trials.
The server --reasoning-budget 4000 works: max <think> block across these
trials is ~3.9–4.5k tokens (never near 32k). But 8 trials still hit a 32000-token
length stop — and the runaway is now in ordinary assistant text, which the
reasoning budget doesn't touch. Clearest case polyglot-rust-c: a 97,344-char
single text block looping "OK, let me try a completely different approach… Wait,
I've tried this…" until it hit the 32k ceiling. build-pov-ray burned 296
turns / 242k output tokens doing this. The reasoning cap plugged one leak; the
model found another.
4 tasks (crack-7z-hash, gpt2-codegolf, qemu-alpine-ssh, write-compressor)
ran the full 1800s agent budget (base timeout × 2.0) and were killed.
write-compressor streamed a 54 MB transcript / 195k output tokens — pure
generation, not compute. These interact with B: a runaway text loop eats the wall
clock.
19 of 26 non-passing trials end with success language ("Done!", "14/14 PASS",
"byte-for-byte match ✅", "The best move is…"). Examples:
- db-wal-recovery: gave up on real decryption and fabricated the 6 missing
records by "following the obvious pattern (alphabetical fruits, value = id×100)",
then declared the DB recovered.
- train-fasttext: "Done! model saved to /app/model.bin" — reward 0.
- chess-best-move: confidently reported a Stockfish eval for the wrong move.
Several of these tasks ship a check.py/verifier the agent could have run
(regex-chess explicitly says "You can look at the provided check.py"). The model
mostly doesn't self-verify before declaring done.
mteb-retrieve = EnvironmentStartTimeoutError (image build, infra).
caffe-cifar-10 = NonZeroAgentExitCodeError (build never produced
training_output.txt). The residual (e.g. chess-best-move,
feal-style crypto, regex-chess correctness) is genuine model difficulty.
harnesses/minimal_pi.py / run config)Ordered by expected ROI. The theme: make context and generation bounded and self-correcting, since a local model loops more than a frontier one.
Fix the context-window mismatch (highest ROI, ~cheap).
Declared context_window (131072) is half the server -c (262144), yet
contexts still overflowed to 262k and hard-400'd. Two independent fixes, do
both:
- Set the declared context_window a safe margin below the server -c
(e.g. 230000 against a 262144 server) so pi's proactive compaction fires
before the server's hard limit — never at/above it. The current 131072 is
being ignored/outrun; align it with reality but keep a floor of headroom
(≥ max_tokens + a few k).
- Catch the 400 … exceeds the available context size response in the
provider path and trigger compaction + retry instead of erroring the turn.
This directly recovers path-tracing-reverse and video-processing (2 dead
trials → ret/compact). pi's reactive compaction already exists; the harness
just needs the 400 to route into it rather than surfacing as stopReason:
error.
Bound single-turn generation without hurting legit big writes (mode B).
max_tokens=32768 exists so large file writes aren't truncated — keep that for
file-write tool calls, but the 32k ceiling is also what lets a plain-text
reasoning loop run to 97k chars. Options, cheapest first:
- Lower max_tokens to ~12–16k as the default and let the model chunk large
writes across turns (most passing tasks used <6k output/turn; only runaways
used 32k). Fail-fast on loops, minor cost to genuinely-huge single writes.
- Add an anti-repetition stop (see server §6.3 — DRY sampler) so loops die
on their own regardless of max_tokens.
- Detect stopReason: length in the harness and inject a system nudge
("your last message was truncated for looping; stop and take a different
concrete action") instead of silently continuing.
Add a "verify before you declare done" contract (mode D — biggest scored
win). This is the highest-value prompt/scaffold change. In the harness'
system preamble (pi supports an extension/prompt hook), instruct the agent to,
before its final answer: (a) run any provided check.py/test, (b) re-read the
file it claims to have written, (c) only claim success on observed passing
output — never on "the pattern suggests…". db-wal-recovery,
train-fasttext, reshard-c4-data, headless-terminal all had a way to
self-check and skipped it. Even a modest lift here converts several near-misses.
Make compaction preserve a task-anchored summary. When pi compacts, ensure the retained summary re-states the original task text + the acceptance criteria + files already written. The post-compaction turns are where the model most often loses the thread and declares premature success.
Instrument for data-driven tuning (you asked for robust + data-driven).
Emit per-trial, per-turn: stopReason counts, max <think> tokens, max text
tokens, #compactions (threshold vs overflow), peak input tokens, whether any
provided verifier was executed. scripts/summarize_run.py already exists —
extend it to tabulate these so budget/window sweeps are one command. (This
analysis was reconstructed by hand from pi.txt; bake it in.)
Per-task-class timeouts. Build/compile/train tasks (caffe, compcert, qemu,
fasttext) legitimately need the wall clock; the 2.0 multiplier is right for
them but wastes an hour on a text-loop task. Consider a lower agent timeout
combined with anti-loop (#2) so runaways are killed in minutes, and reserve the
long budget for tasks that are actually compute-bound.
llama_cpp_on_my_desktop/tools/llama-swap/config.yaml)The config is already thoughtful (full GPU offload, MTP speculative decode, per-model contexts, native reasoning budget). Concrete levers you may have missed:
The qwen35moe_base sampler is --temp 0.6 --top-p 0.95 --top-k 20 with no
repetition or DRY penalty. The plain-text loops in §4-B ("Wait, I've tried
this… let me try a completely different approach…" ×N) are exactly what llama.cpp's
DRY sampler suppresses. Try adding to the qwen bases:
--dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2
(and/or a mild --repeat-penalty 1.05 --repeat-last-n 256). This is a data-driven
experiment: A/B it on polyglot-rust-c, build-pov-ray, path-tracing — the
three worst loopers — and watch the length-stop count drop. Keep temp/top-p/top-k
at Qwen's recommended values; DRY is orthogonal.
KV_TYPE=q4_0 (from compose env) is applied to both K and V at a 262k
context. q4_0 KV is aggressive and is known to degrade long-context recall/reasoning
— which is precisely where these agent trajectories live (100k–260k contexts). The
MoE weights are only ~15.8 GB on a 24 GB card, so you likely have VRAM headroom.
Test KV_TYPE=q8_0 (or K=q8_0 / V=q4_0) on a couple of long-context tasks
(path-tracing, schemelike-metacircular-eval) and compare pass rate. If it fits,
q8_0 KV is a cheap quality win for exactly the tasks that are overflowing.
--reasoning-budget 4000 is confirmed working (max think ≈3.9–4.5k, zero reasoning
runaways). RUNS.md itself flags that 4000 was never swept for breadth. Now that
the full suite is running, do the 2000 / 4000 / 8000 sweep it calls for — but
note the new bottleneck is text generation (§4-B), not thinking, so pair the
budget sweep with the DRY experiment (§6.1) rather than expecting the budget alone
to help further.
Don't lower the server -c — the MoE fits 262k and you want the headroom. The
overflow problem is a client/harness issue (§5.1): pi must compact below 262k.
Server-side, the only useful change is to make sure the OOM/exceed response is a
clean 400 the harness can catch (it already is).
MTP speculative decode is already on (--spec-type draft-mtp --spec-draft-n-max 2).
If timeouts stay common on the full suite, worth testing --spec-draft-n-max 3–4
(more draft tokens/step; MoE + MTP usually accepts them) and confirming -fa on
(flash-attention) is active — both raise tok/s on the exact long-generation tasks
that time out. Measure with the per-request eval time … tok/s already in the
llama-server logs.
context_window to ~230k and catch the
262k 400 → compact/retry. Recovers 2 dead trials immediately + reduces lossy
compaction across 11.max_tokens to ~12–16k (§5.2) — fail-fast on remaining loops.summarize_run.py metrics.Re-run the smoke set (regex-log, openssl-selfsigned-cert, plus one looper like
polyglot-rust-c and one overflow task like path-tracing) after each change —
that four-task set exercises every mechanical mode above and gives a fast read.