VRAMPilot
Runs GGUF models locally and recovers from out-of-memory at runtime instead of crashing.
VRAMPilot is a free local tool that runs GGUF models with llama.cpp and recovers from out-of-memory at runtime instead of crashing. You point it at a .gguf file; it reads your GPU, picks a configuration that fits, launches a llama-server, and if the model still runs out of memory — at boot or in the middle of a generation — it backs off, retries until it serves, and reports what it traded away to make the model fit.
The problem: OOM crashes
The first wall most people hit running a local model is out-of-memory. A tool estimates that a model fits, launches it, and the estimate turns out to be wrong — fragmentation, another application holding VRAM, a longer context than the math assumed. The popular tools prevent OOM at load time; when the estimate is wrong anyway, the server crashes, or silently spills into system RAM and slows to a crawl without telling you why.
An estimator is a guess made before launch. The recovery loop is what runs after the guess is already wrong.
What it does
Five things, each validated on real hardware — every figure on this site links the validation file it comes from, and those files are published verbatim under /proofs/ — click any figure to read its raw source.
- Auto-fit plan. It reads the model from the actual GGUF header bytes (layers, MoE experts, quant, size) and your GPU's free VRAM (measured on NVIDIA, estimated on AMD/Intel/Apple — the report always says which), then picks GPU layers, context, KV-cache precision and MoE expert-offload so the model fits. Validated up to a 9.5 GB file on an 8 GB GPU.
- Runtime OOM-recovery. If the launch still hits an out-of-memory error, it detects it, backs off across axes — KV-cache precision first (to keep your context), then context, then more CPU offload — and retries until the server boots and actually serves a token. Then it shows you the back-off trail.
- Persistence of what booted. Every configuration that actually booted is stored in a local, append-only SQLite database. The next launch starts at the known-good configuration; a driver or GPU change invalidates it and triggers a replan. Nothing leaves your machine, and a
configs listcommand shows everything it remembers. - In-inference watchdog. It covers the crash the load-time tools do not: VRAM exhaustion in the middle of a long generation. In a validated run under real external VRAM pressure, the floor was crossed at 102 MiB free and the server recovered in 223.9 s at a degraded configuration — while the pressure stayed. The honest cost, stated by the tool itself: the generation in flight is lost. Recovery is real, not invisible.
- Zero-prerequisite install. No llama.cpp setup needed: the first run fetches a pinned llama.cpp build for your OS and GPU with mandatory SHA256 verification (a mismatch deletes the file and aborts). Measured on the gate machine: 7.6 s from a cold start to a real served completion — with a 1 GB test model already on disk (the timing includes the pinned binary fetch, not the model download).
Honest scope
- VRAMPilot is a UX/automation layer on top of llama.cpp, which does the actual inference (loading, offload, serving). It is not a new runtime and not a research breakthrough. If llama.cpp cannot run your model, neither can VRAMPilot.
- Load-time auto-fit now exists upstream: recent llama.cpp builds ship a
-fitoption, and LM Studio, Ollama and Jan all estimate before launch. Load-time prevention is table stakes. What remains unserved — and what VRAMPilot adds — is runtime recovery, persistence of what actually booted, and honest lossiness reporting. - The moat is an engineering moat, not a defensible one. Any incumbent could add a recovery loop in a point release. The bet is being correct and thorough on something genuinely unserved today.
- No software beats physics. A model that does not fit even at the floor configuration cannot run; VRAMPilot tells you it hit the floor instead of pretending.
No cookies
This site sets no cookies and loads nothing from third parties. Like the tool itself: local, nothing transmitted.