Near-Lossless Inference at 183 tok/s: Gemma 4 MoE on RTX 5090
Google dropped Gemma 4 on April 2. By April 3, I had the MoE variant running at near-lossless quality on a single RTX 5090 — faster than my previous daily driver at a lower quant level. The trick is the architecture.
The Model
Gemma 4 26B-A4B. 25.2 billion total parameters, 3.8 billion active per token. Mixture of Experts with 128 experts and 8 active, plus a shared expert that's always on. Apache 2.0 licensed. It ranks #6 on LMArena and retains 96-99% of the dense 31B model's quality while activating a fraction of the compute.
Standard Transformer MoE — no Mamba/SSM layers like the Qwen3.5 models I was running before. But it has something more interesting for single-GPU setups: a dual-attention architecture.
Why Q8_0 Fits on 32GB
Normally, Q8_0 of a 26B model would be about 26.9GB on disk. With a standard transformer, the KV cache at 200K context would add 8-10GB. That's 35-37GB — way over the 5090's 32GB VRAM.
Gemma 4 has a trick. Of its 30 layers:
- 24 layers use sliding window attention (1024 tokens). These only store 1024 tokens of KV, regardless of how long the conversation is.
- 6 layers use full global attention (every 5th layer). Only these scale with context length.
The KV cache math at 200K context with q8_0:
- 24 sliding layers: ~100MB total (fixed)
- 6 global layers (2 KV heads, 512 head dim): ~2.3GB
- Total KV: ~2.4GB
Compare that to a standard transformer with 30 layers of full attention at the same dimensions — that'd be ~12GB of KV. Gemma 4's sliding window architecture cuts KV cache by 5x.
So: 26.9GB model + 2.4GB KV + 1.5GB CUDA overhead = 30.8GB. That's under 32GB with 1.35GB to spare.
The Numbers
Gemma 4 26B-A4B Q8_0 (RTX 5090, llama.cpp b8660, CUDA 12.8):
| Metric | Value |
|---|---|
| Token generation | 178 tok/s (llama-bench tg128) |
| Prompt processing | 9,802 tok/s (pp512), sustained at pp2048/pp8192 |
| Context window | 200K tokens |
| KV cache | q8_0/q8_0 |
| VRAM used | 30.8GB / 32GB |
| Quality loss from BF16 | Negligible — Q8_0 is near-lossless |
Nearly 10,000 tokens per second on prompt processing. Sustained across all prompt lengths — 9,802 at pp512, 9,773 at pp2048, 9,589 at pp8192. The Q8_0 dequantization kernels on Blackwell are extremely efficient because the format is simple — no mixed-precision tensor splitting like the K-quants.
For comparison, here's what I was running before:
| Model | Quant | PP512 | TG128 | Context | Quality |
|---|---|---|---|---|---|
| Gemma 4 26B MoE Q8_0 | Q8_0 | 9,802 | 178 | 200K | Near-lossless |
| Gemma 4 26B MoE UD-Q6_K_XL | Dynamic Q6 | — | 153 | 262K | Very good |
| Qwen3.5-35B MoE UD-Q5_K_XL | Dynamic Q5 | 6,670 | 196 | 262K | Good |
| Qwen3.5-27B Dense UD-Q6_K_XL | Dynamic Q6 | 3,415 | 52 | 131K | Very good |
The Q8_0 MoE is faster than the Q6_K_XL version of the same model (178 vs 153 tok/s gen) and blows away everything on prompt processing — 47% faster PP than Qwen's MoE, which was the previous champion. Q8_0 uses optimized dequantization kernels that run more efficiently on Blackwell than the mixed-precision tensor allocations in Dynamic quants. You're getting better quality AND better speed.
Qwen's MoE edges it on raw generation speed (196 tok/s) thanks to Mamba2 SSM layers being cheaper than attention. But Gemma's MoE at Q8 is near-lossless while Qwen is at Q5 — a significant quality gap. And Gemma's prompt processing advantage means it chews through long contexts almost 50% faster.
The Stack
The inference engine is llama.cpp b8660, built with CUDA 12.8 and native Blackwell sm_120a. Gemma 4 architecture support landed in PR #21309 on April 2 — day-one. I had to build from master since it wasn't in a tagged release yet.
The whole lifecycle is managed by Rookery, a Rust daemon that handles model hot-swap, health monitoring, and agent management. Swapping between Gemma 4 MoE Q8, Gemma 4 Dense, and Qwen profiles is one command: rookery swap gemma4_moe_q8. The daemon's new upstream release monitor checks llama.cpp for updates every 30 minutes so I don't miss architecture fixes — useful when you're chasing day-one model support.
Hermes (Nous Research's AI agent) runs as a managed service under Rookery's watchdog, with tool calling and web search on the local inference server. The same model that generates 183 tok/s also handles function calling cleanly — Gemma 4 has native tool use support built into the architecture.
The Takeaway
Sliding window attention is a game-changer for consumer GPUs. It decouples KV cache size from context length for the majority of layers, letting you run higher quants than the parameter count would normally allow. A 26.9B Q8_0 model has no business fitting in 32GB at 200K context — but the architecture makes it work.
If you're on an RTX 5090 or any 32GB card, look for models with sliding window or hybrid attention. The KV cache savings are real and they compound: you can push quality up (Q8 instead of Q5/Q6) or context up (200K+ instead of 131K) or both, depending on what your workload needs.
The MoE + sliding window combination is the current sweet spot for single-GPU inference. Near-lossless quality, 183 tok/s, 200K context, on hardware you can buy at Micro Center. Not bad for day two of a model release.