Point it at a model. It fits it to your GPU, and when the memory overflows it backs off and keeps running instead of dying. Works on NVIDIA, AMD and Apple.
Run your first model — free
Real VRAM, right now — NVIDIA, AMD or Apple. Not a static guess.
Picks the quant, KV-cache precision and context that should fit your card.
If it still overflows: KV-quant → shrink context → offload layers, and retries until it boots and generates.
An OpenAI-compatible endpoint, plus an honest note on what it traded off to fit.
$ vrampilot serve Qwen2.5-7B-Q4.gguf GPU: NVIDIA RTX 3070 · 7.7 / 8.0 GB free plan: -ngl 99 -c 262144 ✕ CUDA out of memory ↳ KV cache → q8_0 (keep context) ✕ out of memory ↳ KV cache → q4_0 (keep context) ✕ out of memory ↳ context 262144 → 131072 ✓ booted ✓ serving · http://127.0.0.1:8091/v1 · reply: "OK" # recovered instead of crashing — measured, not staged
Independent deep-engineering lab based in Sète, France. We build practical AI systems, automation platforms and next-generation software — focused on real-world reliability.
zmlabs.ai →