Skip to content
← all posts

The Single-GPU Inference Stack in 2026

#rust#llm#inference#rtx-5090#llama-cpp#rookery

Running a 35B model at 213 tokens per second on a single consumer GPU, managed by an open-source daemon that handles crash recovery, model hot-swap, and agent lifecycle. Here's the full stack.

The Hardware

RTX 5090. 32GB VRAM, 1,792 GB/s memory bandwidth, Blackwell architecture. It's the current ceiling for consumer inference — 78% more bandwidth than the 4090 and 8GB more VRAM.

That bandwidth number is what matters for token generation. LLM inference is memory-bandwidth bound: every token requires reading the model weights from VRAM. The formula is roughly bandwidth / model_size = tokens_per_second. More bandwidth, more tokens.

The rest of the system: Threadripper 7970X (32 cores), 128GB RAM, Ubuntu 24.04. The CPU barely matters with full GPU offload — 4 threads for inference, 24 for batch processing.

The Model: Why MoE is the Single-GPU Sweet Spot

Qwen3.5-35B-A3B. 35 billion total parameters, but only 3 billion active per token. It's a Mixture of Experts architecture — 256 experts with 8 active per token, plus Mamba2 SSM layers for efficiency.

This is why it's fast. The GPU reads ~3B parameters per token, not 35B, but you get 35B-quality outputs. The tradeoff: all 35B parameters must be loaded into VRAM even though only a fraction fire.

At UD-Q5_K_XL quantization (24.6GB on disk), the model loads with room for 262K tokens of context using q8_0 KV cache quantization. That's the full native context window with near-lossless quality.

The numbers (llama-bench, CUDA 12.8):

  • Token generation: 196 tok/s (llama-bench tg128), ~213 tok/s in-server with warm cache
  • Prompt processing: 6,670 tok/s at pp512, sustained across pp2048 and pp8192
  • VRAM: 31.8GB used, 0.8GB headroom

I also run a dense 27B model (Qwen3.5-27B at Q6_K_XL) at 52 tok/s gen and 3,415 tok/s PP — it's the production workhorse for my AI agent because dense models have more reliable structured output. More on that below.

The MoE model is faster than most people's 7B inference on a 3060.

The Inference Engine: llama.cpp

llama.cpp b8580, built with CUDA 12.8 and native Blackwell sm_120a support. No workarounds, no FORCE_CUBLAS — native kernel support for Blackwell landed in b8196 and has been refined since. Avoid CUDA 13.x — there's a known nvcc compiler bug that generates broken MMQ kernels for Blackwell.

The key optimization in b8579: a rewritten MoE GEMV kernel that uses warp-level reduction instead of shared memory synchronization. This single change delivered a 25% generation speed improvement — from 170 tok/s to 213 tok/s on the same model and hardware.

Flash attention is required for KV cache quantization. Without it, q8_0 and q4_0 KV cache types aren't available, and you'll burn 2-4x more VRAM on context.

The Management Layer: Rookery

Running a model is easy. Keeping it running 24/7 is the actual problem.

llama-server doesn't manage itself. It doesn't restart after CUDA errors. It doesn't tell you when it's silently broken (responding to health checks but unable to generate). It doesn't manage the agent processes that depend on it.

Rookery is what I built to solve this. It's a Rust daemon + CLI + embedded web dashboard that manages the full inference lifecycle:

Hot-swap without downtime. Four model profiles configured — MoE fast, MoE thinking, dense, and Nemotron. rookery swap qwen_thinking drains in-flight requests, stops the old model, loads the new one, health-checks it, and restarts dependent agents. Takes about 10 seconds.

Inference canary. Every 60 seconds, the daemon sends a 1-token completion request to verify the CUDA pipeline is functional. Not just a health check — an actual inference request. If it fails twice, the server is automatically restarted. It also watches llama-server's stderr for CUDA error strings and triggers an immediate canary on detection.

Agent watchdog. I run Hermes — Nous Research's multi-platform AI agent with tool calling, web browsing, vision, and voice — connected to the local inference API. It supports Telegram, Discord, Slack, WhatsApp, and more from a single gateway process. Rookery manages it as a supervised agent — starts it on boot, restarts it on crash with exponential backoff, bounces it when the inference server restarts (clearing stale connections), and watches stderr for fatal error patterns. Voice messages are transcribed locally via faster-whisper on CPU without touching the GPU.

Auto-sleep. With idle_timeout = 1800, the model unloads after 30 minutes of no inference traffic. The next API request wakes it transparently. Power draw drops from ~340W under active inference to ~140W idle.

GPU monitoring. Real-time VRAM, temperature, utilization, power draw, and per-process memory usage via NVML. Exposed as Prometheus metrics at /metrics and as live gauges in the dashboard.

The Dashboard

Seven tabs: Overview (GPU gauges, server status, agent summary), Settings (profile switcher, sampling params), Agents (watchdog state, dependency ports, filtered logs), Chat (streaming playground with abort), Bench (PP + gen speed), Logs (live viewer), Models (HuggingFace search, VRAM-aware recommendations, one-click download).

It's a Leptos WASM frontend embedded directly in the daemon binary. One binary, zero external dependencies for the UI.

The Reliability Story

I chaos-tested the system before open-sourcing it. The results:

Kill llama-server with SIGKILL: Canary detects the failure within 60 seconds, restarts the server with a new PID, bounces Hermes for a fresh connection. Fully recovered in ~70 seconds.

Kill the agent: Watchdog detects the crash within 30 seconds, restarts with 1-second backoff, escalating to 60 seconds on repeated failures. Resets after 5 minutes of healthy uptime.

Rapid model swaps (4x in 30 seconds): All clean. Each swap drains, stops, starts, health-checks, and restarts agents. The operation mutex serializes them so they don't collide.

Daemon restart: Persisted state is reconciled on startup. Running llama-server processes are adopted (not restarted). Agent PIDs are adopted and bounced for fresh connections. Running vLLM containers are detected via Docker inspect.

Three race conditions were found and fixed during this testing:

  1. The canary could revert a model swap by using a stale profile name
  2. The dashboard could remove crashed agents from tracking before the watchdog detected them
  3. Agents could start before the inference server was ready on cold boot

All three are now fixed with regression tests. 341 tests total across the workspace.

The Dense Model: The Production Workhorse

The MoE model is fast, but my AI agent runs on the dense Qwen3.5-27B at Q6_K_XL (25.7GB). At 52 tok/s it's not as flashy, but it has measurably better structured output — Q6 dense scores 72.2 vs 67.3 for the MoE on BFCL-V4 tool calling benchmarks. When your agent is making function calls, parsing JSON, and chaining multi-step actions, that reliability gap matters more than raw speed.

The 27B dense is genuinely one of the best models in its class. Near-lossless Q6 quantization at 128K context, 3,415 tok/s prompt processing, and rock-solid tool calling. It handles voice transcription pipelines, web extraction, multi-turn conversations, and complex tool chains without breaking a sweat. For single-user agentic work, 52 tok/s is more than fast enough — most of the latency is in tool execution, not generation.

The VRAM math is tight but works: 25.7GB model + q4_0 KV cache at 128K context = 29.2GB used, 2.8GB free. Flash attention and KV cache quantization make it possible — without q4_0 KV, the Q6 model doesn't fit at any useful context length.

GPU Recommendations for Other Hardware

The principles scale to any NVIDIA GPU:

GPU VRAM Best MoE Best Dense Gen tok/s
RTX 3060 12GB 12 GB 8B Q5_K_M ~50
RTX 3090 24GB 24 GB Qwen3.5-35B-A3B Q4_K_XL 27B Q4_K_M ~100 / ~35
RTX 4090 24GB 24 GB Qwen3.5-35B-A3B Q4_K_XL 27B Q5_K_M ~130 / ~45
RTX 5090 32GB 32 GB Qwen3.5-35B-A3B Q5_K_XL 27B Q6_K_XL ~213 / ~54

The RTX 3090 remains the best value in 2026 — 24GB at $600-800 used. The 4060 Ti (16GB, 288 GB/s) is a trap: more VRAM than a 3060 but 3x slower bandwidth.

MoE models are the unlock for 24GB+ cards. Without them, you're capped at 27B dense models. With them, you get 35B-quality at 3B-speed.

The Stack

Hardware:    RTX 5090 (32GB, 1792 GB/s)
Model:       Qwen3.5-35B-A3B MoE (UD-Q5_K_XL, 24.6GB)
Engine:      llama.cpp b8580 (CUDA 12.8, sm_120a, flash attention)
Management:  Rookery (daemon + CLI + dashboard)
Agent:       Hermes (multi-platform AI agent, managed by Rookery watchdog)
Performance: 196 tok/s gen, 6,670 tok/s PP, 262K context, 24/7 uptime

Everything is open source:

The total cost: one GPU, ~$12/month in power at idle, zero API fees. For single-user inference — AI agents, coding tools, chat across any platform — this stack replaces a cloud API subscription entirely.