Documentation

VRAMPilot installs as a Python package and needs nothing else: no llama.cpp setup, no manual download of a llama-server binary. This page covers installation, how the recovery ladder works, what is persisted, how the watchdog behaves, and the requirements.

Install

pipx install vrampilot          # or: pip install vrampilot

Run it on a model:

vrampilot your-model.gguf --ctx 8192      # CLI: plan -> launch -> recover -> serve
vrampilot-web                             # web UI: paste a .gguf path -> Launch -> chat

The web UI listens on http://127.0.0.1:8770. Both entry points end with an OpenAI-compatible endpoint you can point any client at. Module form also works: python -m vrampilot.cli / python -m vrampilot.web.

How the recovery ladder works

When the server fails to come up (or boots but cannot generate a token), VRAMPilot reads the error log, classifies it as out-of-memory, backs off one step, and retries. The order is designed to preserve the most capability:

  1. KV-cache precision first (f16 -> q8_0 -> q4_0). This shrinks the KV cache while keeping your full context. It is lossy on long reasoning — the report says so.
  2. Then shrink context — halve until it fits.
  3. Then offload more to CPU — more MoE expert-offload, or fewer GPU layers for dense models.
  4. Floor — if nothing fits, it says so honestly instead of pretending.

Validated example (raw trail in validation/RESULTS.md): given a deliberately impossible configuration at 262144 context, it recovered to 131072 context — 4x more than a context-only back-off would have kept — and served a real reply.

The OOM detector matches error strings across CUDA, Vulkan, ROCm and Metal.

Persistence

Every configuration that actually booted is persisted per (machine fingerprint, model header hash, requested context) in a local SQLite database at ~/.governor/configs.db (the engine keeps the governor name on purpose). It is append-only: a configuration that stops working is kept as history, never overwritten. The next launch boots the known-good configuration directly; a driver or GPU change invalidates the entry and triggers a replan.

Why it matters: a failed attempt costs a full model load — measured at ~220 s per attempt for a 9.5 GB MoE model loaded with --no-mmap on the gate machine's disk. The cache exists so you never pay that twice.

Transparency controls:

vrampilot configs list        # inspect everything it remembers
vrampilot model.gguf --no-cache   # bypass the cache entirely, force a replan

Nothing leaves your machine.

Watchdog behavior

While the server runs, the watchdog covers VRAM exhaustion during a generation — for example, you open another GPU application mid-stream.

Known limit: llama.cpp cannot shrink a live server's KV cache, so the only runtime action is this controlled restart. If upstream lands hot KV resize, VRAMPilot will adopt it and retire the restart for that path.

Requirements

Python 3.9+ is the one declared prerequisite of the pip/pipx install — everything else (including the inference engine binary) is fetched and verified on first run.