VRAMPilot

Runs GGUF models locally and recovers from out-of-memory at runtime instead of crashing.

VRAMPilot is a free local tool that runs GGUF models with llama.cpp and recovers from out-of-memory at runtime instead of crashing. You point it at a .gguf file; it reads your GPU, picks a configuration that fits, launches a llama-server, and if the model still runs out of memory — at boot or in the middle of a generation — it backs off, retries until it serves, and reports what it traded away to make the model fit.

The problem: OOM crashes

The first wall most people hit running a local model is out-of-memory. A tool estimates that a model fits, launches it, and the estimate turns out to be wrong — fragmentation, another application holding VRAM, a longer context than the math assumed. The popular tools prevent OOM at load time; when the estimate is wrong anyway, the server crashes, or silently spills into system RAM and slows to a crawl without telling you why.

An estimator is a guess made before launch. The recovery loop is what runs after the guess is already wrong.

What it does

Five things, each validated on real hardware — every figure on this site links the validation file it comes from, and those files are published verbatim under /proofs/ — click any figure to read its raw source.

  1. Auto-fit plan. It reads the model from the actual GGUF header bytes (layers, MoE experts, quant, size) and your GPU's free VRAM (measured on NVIDIA, estimated on AMD/Intel/Apple — the report always says which), then picks GPU layers, context, KV-cache precision and MoE expert-offload so the model fits. Validated up to a 9.5 GB file on an 8 GB GPU.
  2. Runtime OOM-recovery. If the launch still hits an out-of-memory error, it detects it, backs off across axes — KV-cache precision first (to keep your context), then context, then more CPU offload — and retries until the server boots and actually serves a token. Then it shows you the back-off trail.
  3. Persistence of what booted. Every configuration that actually booted is stored in a local, append-only SQLite database. The next launch starts at the known-good configuration; a driver or GPU change invalidates it and triggers a replan. Nothing leaves your machine, and a configs list command shows everything it remembers.
  4. In-inference watchdog. It covers the crash the load-time tools do not: VRAM exhaustion in the middle of a long generation. In a validated run under real external VRAM pressure, the floor was crossed at 102 MiB free and the server recovered in 223.9 s at a degraded configuration — while the pressure stayed. The honest cost, stated by the tool itself: the generation in flight is lost. Recovery is real, not invisible.
  5. Zero-prerequisite install. No llama.cpp setup needed: the first run fetches a pinned llama.cpp build for your OS and GPU with mandatory SHA256 verification (a mismatch deletes the file and aborts). Measured on the gate machine: 7.6 s from a cold start to a real served completion — with a 1 GB test model already on disk (the timing includes the pinned binary fetch, not the model download).

Honest scope

No cookies

This site sets no cookies and loads nothing from third parties. Like the tool itself: local, nothing transmitted.