Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured
Llama-4-Maverick (17B active / 128-expert MoE) CPU inference on EPYC 9175F, comparing Q4_K_M and Q8_0. Q4 delivers 21-24 tok/s, Q8 delivers 15-16 tok/s. Quantization selection criteria for MoE CPU inference on 768GB RAM.
Background
Llama-4-Maverick is Meta’s MoE model with 128 experts. Active parameters stay at ~17B, making CPU inference viable despite the large total parameter count.
On EPYC 9175F with 768GB DDR5, both Q4_K_M (~290GB) and Q8_0 (~426GB) fit in physical memory. The question is which to choose. This test compares both on identical hardware and tasks to establish per-use-case selection criteria.
Objective
- Compare Q4_K_M and Q8_0 Prefill/Decode speeds on identical hardware
- Quantify TTFT (Time To First Token) difference
- Evaluate output quality difference on practical tasks
- Establish selection criteria for batch processing vs aider workflows
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | Not used (CPU inference) |
| OS | Ubuntu 24.04 LTS |
| Runtime | llama.cpp server (Podman rootless) |
| Threads | 14 |
| Context | 16,384 |
Model Specifications
| Item | Q4_K_M | Q8_0 |
|---|---|---|
| Architecture | Llama-4-Maverick (MoE, 128E) | Same |
| Model Size | ~290GB | ~426GB |
| Quantization | 4-bit mixed | 8-bit |
Results
Speed Comparison
| Metric | Q4_K_M | Q8_0 | Delta |
|---|---|---|---|
| PP(tok/s) | 65-68 | 50-52 | Q4 ~30% faster |
| TG(tok/s) | 21-24 | 15-16 | Q4 ~40% faster |
| TTFT (800-1000tok) | 12-17s | 16-20s | Q4 3-5s faster |
Q4_K_M Measurements
| Prompt(tok) | PP time(s) | PP(tok/s) | Gen(tok) | TG(tok/s) |
|---|---|---|---|---|
| 819 | 12.0 | 68.4 | 165 | 23.7 |
| 1,101 | 16.8 | 65.4 | 814 | 21.3 |
Q8_0 Measurements
| Prompt(tok) | PP time(s) | PP(tok/s) | Gen(tok) | TG(tok/s) |
|---|---|---|---|---|
| 819 | 15.6 | 52.4 | 104 | 16.6 |
| 1,000 | 19.7 | 50.8 | 916 | 15.2 |
Memory Behavior
- Q4_K_M: mmap + page cache loads only needed portions. Prompt cache ~180-380 MiB per entry
- Q8_0: Hundreds of GB in buff/cache. RSS appears small (normal mmap behavior)
- MoE (128E): only active experts are frequently accessed; unused pages stay cold
Analysis
Speed Gap Explained
Q4_K_M is ~68% of Q8_0’s model size (290GB vs 426GB). MoE inference reads selected expert weights from memory per token. Smaller quantization means less data per read, directly reducing memory bandwidth pressure.
The TG speed gap (40%) exceeding the model size gap (32%) is due to Decode’s random access pattern amplifying cache miss penalties and memory latency impact.
Quality Difference in Practice
Q8_0 advantages:
- More stable context retention (less topic drift in long conversations)
- Fewer destructive edits (notably in aider code modifications)
- General “reliability” feeling
The gap is not dramatic. Code scaffolding and summarization work fine at Q4_K_M. The difference surfaces in complex repository operations and long-form generation.
Why Q8_0 Is Viable on 768GB RAM
Q8_0’s 426GB is impractical on typical servers. On 768GB DDR5, the model + KV cache + page cache all fit in memory. “Use Q8_0 when memory is abundant” is a rational choice specific to this environment.
Lessons Learned
Q4_K_M vs Q8_0 selection is driven by task nature, not a simple speed-quality trade-off:
- High-throughput oneshot generation, Dagster/dbt batch: Q4_K_M. 40% speed difference compounds over volume
- Aider workflows, complex repo operations, long-form: Q8_0. Reduced destructive edit risk outweighs speed loss
- Daily on/off operation: Q4_K_M. Faster startup
- Always-on resident: Q8_0. Quality stability pays off over long sessions
MoE CPU inference uniquely makes both quantization levels practical. The 128-expert locality pattern enables efficient memory bandwidth utilization, keeping Q8_0 at a usable 15-16 tok/s.
Reproduction Steps
Q4_K_M
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 \
-m /models/Llama-4-Maverick-17B-128E-Instruct-Q4_K_M.gguf \
--jinja -c 16384 \
--threads 14 --threads-batch 14 \
-b 1024 -ub 256 \
--parallel 1 --flash-attn on
Q8_0
Same command, substitute Q8_0 model file.
Technical Notes
MoE mmap Behavior
Only a few of 128 experts are active per token. mmap page cache “warms” unevenly. Sustained single-topic processing hits the same experts repeatedly, improving cache hit rate. Frequent topic changes increase page faults.
Prompt Cache
LCP similarity hit works on both Q4_K_M and Q8_0. With fixed System Prompts, TTFT on subsequent requests drops significantly.

