skip to content
Help

Stuck? Start here. Loom is built so that nothing you can do is unrecoverable — a crash costs only the in-flight sprint, a stop never dirties your main checkout, and a transient outage is waited out, not failed. This page is the operator's lookup for when a run misbehaves.

Jump to the symptom:

inspecting a run

Three verbs tell you everything about a run. status is the cold re-orient command — reach for it first when you've lost context.

loom list every run — name, state, sprint progress, title
loom status <name> title, state, paths, branch, ledger — the cold re-orient
loom logs <name> -f tail the activity journal (-f follows until inactive)

cold re-orientation — after losing context, read loom list then loom status <name>, then go to the durable artifacts on disk:

tail -n 80 .loom/<plan>/history.jsonl
tail -n 80 .loom/<plan>/logs/<stage>.json
tail -n 80 .loom/<plan>/milestones/<milestone>/logs/<stage>.json

Run-level agent logs live under .loom/<plan>/logs/; milestone sprint logs live under that milestone's own logs/. The last timestamped line tells you which stage is still active.

resuming & recovering

Killed process or lost terminal? Resume is just re-running Loom. Done sprints are re-derived from the committed ledger and skipped — you only ever lose the in-flight sprint.

loom resume <name> --headless resume by name
loom feature-plan.md --headless or just re-run the same plan

A run that's genuinely botched — and should not resume — gets wiped instead:

loom feature-plan.md --clean wipe worktree, branch, .loom — then rebuild fresh
loom remove <name> wipe and abandon the plan (alias: --remove)

--clean and --remove dry-run by default — they print what they would delete and ask first; add --yes (-y) to skip the prompt.

stopwhen a run stops
STOP
Loom halted on terminal trouble.

It leaves a by-hand message naming the worktree and branch, plus an investigate report under .loom/<plan>/reports/. Read the report, fix by hand in the named worktree, then re-run Loom to resume.

your main checkout is never left dirty
Whatever the failure, Loom contains it: a stop restores main to exactly its pre-merge state and points you at the disposable worktree — your canonical checkout is untouched.
transient failures

A 529/overloaded, a rate-limit, a 5xx, or a dropped connection is not a stop. Loom retries the stage on the LOOM_RETRY_DELAYS schedule — by default immediate, then +10 min, then +1 h, then give up (four attempts).

retrying a card or line means Loom is waiting, not hung — it's backing off, and an operator Ctrl-C stays responsive during the sleep.

Detection is allow-list only: a genuine error — auth, a CLI bug — stays terminal and is never retried. Set LOOM_RETRY_DELAYS= (empty) to disable retry and fail fast.

spend guards & timeouts

Two distinct ceilings guard against a runaway agent — and the routine one is a no-progress guard, not an elapsed-time limit, so a long chunk on a busy box is never killed just for taking a while.

The two spend guards — env var, default, and what it bounds.
env vardefaultwhat it bounds
LOOM_NO_PROGRESS_SECS1800 (30 min)the execute stall window — reaps an agent only after it emits no new output and no token movement for this long; 0 disables it
LOOM_TIMEOUT14400 (4 h)the absolute runaway backstop — a true wall-clock ceiling, not a routine "split it" trigger

Both are per-unit overridable via a sprint file's front-matter, so an unusually long or quiet unit can raise its own ceilings without changing the job-wide defaults:

.loom/<plan>/NN-slug.md — front-matter
---
no_progress_secs: 3600
timeout_secs: 28800
---

When the stall guard does fire on execute, Loom doesn't discard the work: it commits the worktree and runs the real gate. A green gate ⇒ the chunk is DONE and the run advances.

the agent sandbox

Each agent runs inside a containment wrapper — best-effort bubblewrap by default: a private PID namespace and a read-only main checkout, transparent otherwise. Network and the rest of the filesystem stay read-write, so the agent can build in its worktree, run git and tests, and reach the API.

contained
  • signals & process-group kills (a stray kill(-1) is bounded to the sandbox)
  • fork storms and orphaned subprocesses
  • writes into the main checkout — bound read-only; an absolute-path escape fails with EROFS
not contained (default)
  • writes to $HOME, /tmp, and the worktree
  • build artifacts
  • the network

modes — set LOOM_SANDBOX (also read from .env.loom, so per-project opt-in is one line):

LOOM_SANDBOX modes.
valuebehaviour
unset / auto (default)best-effort — sandbox where the host supports it; otherwise warn once and run unsandboxed
1 / on / requirerequired — refuse to start if the sandbox doesn't work
0 / offnever sandbox
LOOM_SANDBOX_CMD="<argv>"custom wrapper prefix — e.g. a dedicated uid or a container shim

host setup (bubblewrap) — install it, then confirm unprivileged user namespaces work:

install bubblewrap
sudo apt install bubblewrap     # Debian/Ubuntu
sudo dnf install bubblewrap     # Fedora
sudo pacman -S bubblewrap       # Arch
confirm unprivileged user namespaces work
bwrap --dev-bind / / --proc /proc --unshare-pid true && echo ok

If that fails with setting up uid map: Permission denied, your kernel restricts unprivileged userns. On Ubuntu ≥24.04 (AppArmor), the preferred fix grants only bwrap the right — minimal, no system-wide relaxation, no setuid:

grant bwrap the userns right via an AppArmor profile (preferred)
sudo tee /etc/apparmor.d/bwrap >/dev/null <<'EOF'
abi <abi/4.0>,
include <tunables/global>
profile bwrap /usr/bin/bwrap flags=(unconfined) {
  userns,
  include if exists <local/bwrap>
}
EOF
sudo apparmor_parser -r /etc/apparmor.d/bwrap

Or relax it more broadly (reverts a deliberate hardening), or — on distros gated by unprivileged_userns_clone instead — flip that sysctl (persist in /etc/sysctl.d/):

sysctl alternatives (broader)
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
sudo sysctl -w kernel.unprivileged_userns_clone=1

Re-run the check; Loom then reports sandbox — agents run under bwrap … at startup.

backstop on unsandboxed hosts
Where the read-only-main bind isn't in force, Loom diffs the main checkout's git status around each code-mutating stage and stops loudly if an agent's write escaped. If you edit the checkout during a run, opt out with LOOM_CONFINEMENT_CHECK=0.
authauthentication problems
you authenticate the CLI, not Loom
Loom's agents are coding-agent subprocesses — they use whatever auth the selected harness CLI already has. Loom fails fast at startup with that harness's own login hint when auth is missing, so a misconfigured run stops before it spawns a single agent.

So when a run won't start on an auth error, fix it where the credential lives: log the harness CLI in on the host. The default sandbox exposes the usual credential paths, so a CLI logged in on the host is logged in inside the agent sandbox too. Per-harness setup lives on the Harnesses page.

why the sandbox exists

Loom drives each autonomous agent with --dangerously-skip-permissions in your own user account — so without containment, a runaway anywhere in an agent's process tree would have your whole account as its blast radius. That isn't hypothetical: a process-group reap once mis-fired and tore down an entire desktop session. The orchestrator bug behind that is fixed, and the sandbox is the defense-in-depth that keeps a future misfire bounded. Run with LOOM_SANDBOX=1 wherever you require containment.

faqcommon questions
Will resuming re-run sprints that already finished?
No. Done sprints are re-derived from the committed ledger and skipped, so a resume only ever rebuilds the in-flight sprint. A re-run of an already-merged plan reports "already built — nothing to do".
A run is stuck on a line — is it hung?
It's waiting, not hung. A transient upstream blip (529, rate-limit, 5xx, dropped connection) is being retried on the LOOM_RETRY_DELAYS backoff. An operator Ctrl-C stays responsive during the sleep.
Loom stopped — did it leave my main checkout dirty?
No. A stop restores main to exactly its pre-merge state and leaves a by-hand message naming the worktree and branch, plus an investigate report under .loom/<plan>/reports/.
My long chunk got reaped — was it the timeout?
Unlikely. The execute guard is no-progress (LOOM_NO_PROGRESS_SECS), not elapsed-time — it reaps only when an agent emits nothing for the stall window. Raise no_progress_secs: in that sprint's front-matter if a unit is legitimately quiet.
where to go next