Stuck? Start here. Loom is built so that nothing you can do is unrecoverable — a crash costs only the in-flight sprint, a stop never dirties your main checkout, and a transient outage is waited out, not failed. This page is the operator's lookup for when a run misbehaves.
Jump to the symptom:
Three verbs tell you everything about a run. status is the cold re-orient command — reach for it first when you've lost context.
cold re-orientation — after losing context, read loom list then loom status <name>, then go to the durable artifacts on disk:
Run-level agent logs live under .loom/<plan>/logs/; milestone sprint logs live under that milestone's own logs/. The last timestamped ▶ line tells you which stage is still active.
Killed process or lost terminal? Resume is just re-running Loom. Done sprints are re-derived from the committed ledger and skipped — you only ever lose the in-flight sprint.
A run that's genuinely botched — and should not resume — gets wiped instead:
--clean and --remove dry-run by default — they print what they would delete and ask first; add --yes (-y) to skip the prompt.
It leaves a by-hand message naming the worktree and branch, plus an investigate report under .loom/<plan>/reports/. Read the report, fix by hand in the named worktree, then re-run Loom to resume.
A 529/overloaded, a rate-limit, a 5xx, or a dropped connection is not a stop. Loom retries the stage on the LOOM_RETRY_DELAYS schedule — by default immediate, then +10 min, then +1 h, then give up (four attempts).
⏳ card or line means Loom is waiting, not hung — it's backing off, and an operator Ctrl-C stays responsive during the sleep.
Detection is allow-list only: a genuine error — auth, a CLI bug — stays terminal and is never retried. Set LOOM_RETRY_DELAYS= (empty) to disable retry and fail fast.
Two distinct ceilings guard against a runaway agent — and the routine one is a no-progress guard, not an elapsed-time limit, so a long chunk on a busy box is never killed just for taking a while.
| env var | default | what it bounds |
|---|---|---|
| LOOM_NO_PROGRESS_SECS | 1800 (30 min) | the execute stall window — reaps an agent only after it emits no new output and no token movement for this long; 0 disables it |
| LOOM_TIMEOUT | 14400 (4 h) | the absolute runaway backstop — a true wall-clock ceiling, not a routine "split it" trigger |
Both are per-unit overridable via a sprint file's front-matter, so an unusually long or quiet unit can raise its own ceilings without changing the job-wide defaults:
--- no_progress_secs: 3600 timeout_secs: 28800 ---
When the stall guard does fire on execute, Loom doesn't discard the work: it commits the worktree and runs the real gate. A green gate ⇒ the chunk is DONE and the run advances.
Each agent runs inside a containment wrapper — best-effort bubblewrap by default: a private PID namespace and a read-only main checkout, transparent otherwise. Network and the rest of the filesystem stay read-write, so the agent can build in its worktree, run git and tests, and reach the API.
- signals & process-group kills (a stray
kill(-1)is bounded to the sandbox) - fork storms and orphaned subprocesses
- writes into the main checkout — bound read-only; an absolute-path escape fails with
EROFS
- writes to
$HOME,/tmp, and the worktree - build artifacts
- the network
modes — set LOOM_SANDBOX (also read from .env.loom, so per-project opt-in is one line):
| value | behaviour |
|---|---|
| unset / auto (default) | best-effort — sandbox where the host supports it; otherwise warn once and run unsandboxed |
| 1 / on / require | required — refuse to start if the sandbox doesn't work |
| 0 / off | never sandbox |
| LOOM_SANDBOX_CMD="<argv>" | custom wrapper prefix — e.g. a dedicated uid or a container shim |
host setup (bubblewrap) — install it, then confirm unprivileged user namespaces work:
sudo apt install bubblewrap # Debian/Ubuntu sudo dnf install bubblewrap # Fedora sudo pacman -S bubblewrap # Arch
bwrap --dev-bind / / --proc /proc --unshare-pid true && echo ok
If that fails with setting up uid map: Permission denied, your kernel restricts unprivileged userns. On Ubuntu ≥24.04 (AppArmor), the preferred fix grants only bwrap the right — minimal, no system-wide relaxation, no setuid:
sudo tee /etc/apparmor.d/bwrap >/dev/null <<'EOF'
abi <abi/4.0>,
include <tunables/global>
profile bwrap /usr/bin/bwrap flags=(unconfined) {
userns,
include if exists <local/bwrap>
}
EOF
sudo apparmor_parser -r /etc/apparmor.d/bwrap
Or relax it more broadly (reverts a deliberate hardening), or — on distros gated by unprivileged_userns_clone instead — flip that sysctl (persist in /etc/sysctl.d/):
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0 sudo sysctl -w kernel.unprivileged_userns_clone=1
Re-run the check; Loom then reports sandbox — agents run under bwrap … at startup.
git status around each code-mutating stage and stops loudly if an agent's write escaped. If you edit the checkout during a run, opt out with LOOM_CONFINEMENT_CHECK=0.
So when a run won't start on an auth error, fix it where the credential lives: log the harness CLI in on the host. The default sandbox exposes the usual credential paths, so a CLI logged in on the host is logged in inside the agent sandbox too. Per-harness setup lives on the Harnesses page.
Loom drives each autonomous agent with --dangerously-skip-permissions in your own user account — so without containment, a runaway anywhere in an agent's process tree would have your whole account as its blast radius. That isn't hypothetical: a process-group reap once mis-fired and tore down an entire desktop session. The orchestrator bug behind that is fixed, and the sandbox is the defense-in-depth that keeps a future misfire bounded. Run with LOOM_SANDBOX=1 wherever you require containment.
Will resuming re-run sprints that already finished?
A run is stuck on a ⏳ line — is it hung?
LOOM_RETRY_DELAYS backoff. An operator Ctrl-C stays responsive during the sleep.Loom stopped — did it leave my main checkout dirty?
.loom/<plan>/reports/.My long chunk got reaped — was it the timeout?
LOOM_NO_PROGRESS_SECS), not elapsed-time — it reaps only when an agent emits nothing for the stall window. Raise no_progress_secs: in that sprint's front-matter if a unit is legitimately quiet.Back to the rest of the docs: