Preview · honest numbers, not a benchmark
Local AI, unstuck

Run big AI models on a small GPU.
When it runs out of memory, it recovers — it doesn't crash.

Point it at a model. It fits it to your GPU, and when the memory overflows it backs off and keeps running instead of dying. Works on NVIDIA, AMD and Apple.

▸ Try it on your machine
pick a GPU + a model
Your GPU
The model you want to run
VRAM used

Run your first model — free

100% local · no account · no tracking · your conversations never leave your computer
The wall
Everyone hits the same error.
The first time you run a local model, it crashes with CUDA out of memory — or silently crawls 20× slower. You're left guessing which quant and context fit. VRAMPilot does the guessing, and recovers when the guess is wrong.
How it works
Profile → plan → recover → serve.
01

Reads your GPU

Real VRAM, right now — NVIDIA, AMD or Apple. Not a static guess.

02

Plans a fit

Picks the quant, KV-cache precision and context that should fit your card.

03

Recovers from OOM

If it still overflows: KV-quant → shrink context → offload layers, and retries until it boots and generates.

04

Serves + tells you

An OpenAI-compatible endpoint, plus an honest note on what it traded off to fit.

Not a simulation
A real run, measured.
The actual output recovering an impossible config on a real RTX 3070 — KV-quant, then context, until it boots and serves.
vrampilot · RTX 3070 · CUDA
$ vrampilot serve Qwen2.5-7B-Q4.gguf
  GPU: NVIDIA RTX 3070 · 7.7 / 8.0 GB free
  plan: -ngl 99  -c 262144
  ✕ CUDA out of memory
   KV cache → q8_0  (keep context)   ✕ out of memory
   KV cache → q4_0  (keep context)   ✕ out of memory
   context  262144 → 131072          ✓ booted
  ✓ serving · http://127.0.0.1:8091/v1 · reply: "OK"
  # recovered instead of crashing — measured, not staged
Any GPU
Tested on real hardware. All three.
Runs and serves a real completion on each — the OOM-recovery works across CUDA, Vulkan and Metal.
NVIDIA
✓ VALIDATED
RTX 3070 · CUDA — recovered & served
AMD
✓ VALIDATED
Radeon 780M · Vulkan — recovered & served
Apple
✓ RUNS & SERVES
M1 · Metal — runs in browser-class hardware
No hype
What it is — and isn't.

It is

  • The only local tool that recovers from out-of-memory at runtime
  • Auto-picks the quant + context that fit your GPU
  • A clean layer on top of llama.cpp · Win / Linux / macOS

It isn't

  • A new inference engine — llama.cpp does the running
  • Magic — a model truly too big for your GPU is told so, plainly
  • Sending anything anywhere — it's 100% local
Who builds this
About ZMLabs

Independent deep-engineering lab based in Sète, France. We build practical AI systems, automation platforms and next-generation software — focused on real-world reliability.

zmlabs.ai →