Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Background

Llama-4-Maverick is Meta’s MoE model with 128 experts. Active parameters stay at ~17B, making CPU inference viable despite the large total parameter count.

On EPYC 9175F with 768GB DDR5, both Q4_K_M (~290GB) and Q8_0 (~426GB) fit in physical memory. The question is which to choose. This test compares both on identical hardware and tasks to establish per-use-case selection criteria.

Objective

Compare Q4_K_M and Q8_0 Prefill/Decode speeds on identical hardware
Quantify TTFT (Time To First Token) difference
Evaluate output quality difference on practical tasks
Establish selection criteria for batch processing vs aider workflows

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	Not used (CPU inference)
OS	Ubuntu 24.04 LTS
Runtime	llama.cpp server (Podman rootless)
Threads	14
Context	16,384

Model Specifications

Item	Q4_K_M	Q8_0
Architecture	Llama-4-Maverick (MoE, 128E)	Same
Model Size	~290GB	~426GB
Quantization	4-bit mixed	8-bit

Results

Speed Comparison

Metric	Q4_K_M	Q8_0	Delta
PP(tok/s)	65-68	50-52	Q4 ~30% faster
TG(tok/s)	21-24	15-16	Q4 ~40% faster
TTFT (800-1000tok)	12-17s	16-20s	Q4 3-5s faster

Q4_K_M Measurements

Prompt(tok)	PP time(s)	PP(tok/s)	Gen(tok)	TG(tok/s)
819	12.0	68.4	165	23.7
1,101	16.8	65.4	814	21.3

Q8_0 Measurements

Prompt(tok)	PP time(s)	PP(tok/s)	Gen(tok)	TG(tok/s)
819	15.6	52.4	104	16.6
1,000	19.7	50.8	916	15.2

Memory Behavior

Q4_K_M: mmap + page cache loads only needed portions. Prompt cache ~180-380 MiB per entry
Q8_0: Hundreds of GB in buff/cache. RSS appears small (normal mmap behavior)
MoE (128E): only active experts are frequently accessed; unused pages stay cold

Analysis

Speed Gap Explained

Q4_K_M is ~68% of Q8_0’s model size (290GB vs 426GB). MoE inference reads selected expert weights from memory per token. Smaller quantization means less data per read, directly reducing memory bandwidth pressure.

The TG speed gap (40%) exceeding the model size gap (32%) is due to Decode’s random access pattern amplifying cache miss penalties and memory latency impact.

Quality Difference in Practice

Q8_0 advantages:

More stable context retention (less topic drift in long conversations)
Fewer destructive edits (notably in aider code modifications)
General “reliability” feeling

The gap is not dramatic. Code scaffolding and summarization work fine at Q4_K_M. The difference surfaces in complex repository operations and long-form generation.

Why Q8_0 Is Viable on 768GB RAM

Q8_0’s 426GB is impractical on typical servers. On 768GB DDR5, the model + KV cache + page cache all fit in memory. “Use Q8_0 when memory is abundant” is a rational choice specific to this environment.

Lessons Learned

Q4_K_M vs Q8_0 selection is driven by task nature, not a simple speed-quality trade-off:

High-throughput oneshot generation, Dagster/dbt batch: Q4_K_M. 40% speed difference compounds over volume
Aider workflows, complex repo operations, long-form: Q8_0. Reduced destructive edit risk outweighs speed loss
Daily on/off operation: Q4_K_M. Faster startup
Always-on resident: Q8_0. Quality stability pays off over long sessions

MoE CPU inference uniquely makes both quantization levels practical. The 128-expert locality pattern enables efficient memory bandwidth utilization, keeping Q8_0 at a usable 15-16 tok/s.

Reproduction Steps

Q4_K_M

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 \
  -m /models/Llama-4-Maverick-17B-128E-Instruct-Q4_K_M.gguf \
  --jinja -c 16384 \
  --threads 14 --threads-batch 14 \
  -b 1024 -ub 256 \
  --parallel 1 --flash-attn on

Q8_0

Same command, substitute Q8_0 model file.

Technical Notes

MoE mmap Behavior

Only a few of 128 experts are active per token. mmap page cache “warms” unevenly. Sustained single-topic processing hits the same experts repeatedly, improving cache hit rate. Frequent topic changes increase page faults.

Prompt Cache

LCP similarity hit works on both Q4_K_M and Q8_0. With fixed System Prompts, TTFT on subsequent requests drops significantly.

Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Background link

Objective link

Test Environment link

Model Specifications link

Results link

Speed Comparison link

Q4_K_M Measurements link

Q8_0 Measurements link

Memory Behavior link

Analysis link

Speed Gap Explained link

Quality Difference in Practice link

Why Q8_0 Is Viable on 768GB RAM link

Lessons Learned link

Reproduction Steps link

Q4_K_M link

Q8_0 link

Technical Notes link

MoE mmap Behavior link

Prompt Cache link