Background

Llama-4-Maverick is Meta’s MoE model with 128 experts. Active parameters stay at ~17B, making CPU inference viable despite the large total parameter count.

On EPYC 9175F with 768GB DDR5, both Q4_K_M (~290GB) and Q8_0 (~426GB) fit in physical memory. The question is which to choose. This test compares both on identical hardware and tasks to establish per-use-case selection criteria.

Objective

  1. Compare Q4_K_M and Q8_0 Prefill/Decode speeds on identical hardware
  2. Quantify TTFT (Time To First Token) difference
  3. Evaluate output quality difference on practical tasks
  4. Establish selection criteria for batch processing vs aider workflows

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
MemoryDDR5-6400 768GB (12ch)
GPUNot used (CPU inference)
OSUbuntu 24.04 LTS
Runtimellama.cpp server (Podman rootless)
Threads14
Context16,384

Model Specifications

ItemQ4_K_MQ8_0
ArchitectureLlama-4-Maverick (MoE, 128E)Same
Model Size~290GB~426GB
Quantization4-bit mixed8-bit

Results

Speed Comparison

MetricQ4_K_MQ8_0Delta
PP(tok/s)65-6850-52Q4 ~30% faster
TG(tok/s)21-2415-16Q4 ~40% faster
TTFT (800-1000tok)12-17s16-20sQ4 3-5s faster

Q4_K_M Measurements

Prompt(tok)PP time(s)PP(tok/s)Gen(tok)TG(tok/s)
81912.068.416523.7
1,10116.865.481421.3

Q8_0 Measurements

Prompt(tok)PP time(s)PP(tok/s)Gen(tok)TG(tok/s)
81915.652.410416.6
1,00019.750.891615.2

Memory Behavior

  • Q4_K_M: mmap + page cache loads only needed portions. Prompt cache ~180-380 MiB per entry
  • Q8_0: Hundreds of GB in buff/cache. RSS appears small (normal mmap behavior)
  • MoE (128E): only active experts are frequently accessed; unused pages stay cold

Analysis

Speed Gap Explained

Q4_K_M is ~68% of Q8_0’s model size (290GB vs 426GB). MoE inference reads selected expert weights from memory per token. Smaller quantization means less data per read, directly reducing memory bandwidth pressure.

The TG speed gap (40%) exceeding the model size gap (32%) is due to Decode’s random access pattern amplifying cache miss penalties and memory latency impact.

Quality Difference in Practice

Q8_0 advantages:

  • More stable context retention (less topic drift in long conversations)
  • Fewer destructive edits (notably in aider code modifications)
  • General “reliability” feeling

The gap is not dramatic. Code scaffolding and summarization work fine at Q4_K_M. The difference surfaces in complex repository operations and long-form generation.

Why Q8_0 Is Viable on 768GB RAM

Q8_0’s 426GB is impractical on typical servers. On 768GB DDR5, the model + KV cache + page cache all fit in memory. “Use Q8_0 when memory is abundant” is a rational choice specific to this environment.

Lessons Learned

Q4_K_M vs Q8_0 selection is driven by task nature, not a simple speed-quality trade-off:

  • High-throughput oneshot generation, Dagster/dbt batch: Q4_K_M. 40% speed difference compounds over volume
  • Aider workflows, complex repo operations, long-form: Q8_0. Reduced destructive edit risk outweighs speed loss
  • Daily on/off operation: Q4_K_M. Faster startup
  • Always-on resident: Q8_0. Quality stability pays off over long sessions

MoE CPU inference uniquely makes both quantization levels practical. The 128-expert locality pattern enables efficient memory bandwidth utilization, keeping Q8_0 at a usable 15-16 tok/s.

Reproduction Steps

Q4_K_M

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 \
  -m /models/Llama-4-Maverick-17B-128E-Instruct-Q4_K_M.gguf \
  --jinja -c 16384 \
  --threads 14 --threads-batch 14 \
  -b 1024 -ub 256 \
  --parallel 1 --flash-attn on
  

Q8_0

Same command, substitute Q8_0 model file.

Technical Notes

MoE mmap Behavior

Only a few of 128 experts are active per token. mmap page cache “warms” unevenly. Sustained single-topic processing hits the same experts repeatedly, improving cache hit rate. Frequent topic changes increase page faults.

Prompt Cache

LCP similarity hit works on both Q4_K_M and Q8_0. With fixed System Prompts, TTFT on subsequent requests drops significantly.