GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

Background

GLM-4.7-Flash is a 30B-A3B MoE model from THUDM (Tsinghua University), using DeepSeek2 architecture with 64 experts (4 active per token). Total parameters 30B, active 3B—lightweight yet capable with multilingual support and long context (up to 128K tokens).

On our EPYC 9175F + RTX PRO 6000 Blackwell setup, the main question was: how much performance does the “MoE Expert Offload to CPU” hybrid configuration actually deliver compared to CPU-only and Full GPU?

Objective

Quantify Prefill/Decode speeds for GLM-4.7-Flash (IQ5_K) across CPU/Hybrid/Full GPU
Validate the practicality of MoE Expert Offload (exps=CPU)
Obtain comparison data with NVFP4 quantization on vLLM

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS
Runtime (CPU/Hybrid/GPU)	ik_llama.cpp (build 4192, commit 1cb7e1bf)
Runtime (NVFP4)	vLLM (OpenAI API compatible)
Model	GLM-4.7-Flash IQ5_K (GGUF, ubergarm quantization)
Context	131,072 tokens (128K)

Model Specifications

Item	Value
Architecture	DeepSeek2 (MoE)
Layers	47
Experts	64 (4 active)
Shared Experts	1
Attention	MLA (Multi-head Latent Attention)
Training Context	202,752
Vocabulary	154,880

Methodology

Pattern A: CPU-Only

  ksh3@compute-server:~$ podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16

AVX-512 VNNI / BF16 active (AVX512 = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1). All layers on CPU.

Pattern B: Hybrid (Expert=CPU, Attention=GPU)

Same as Pattern A plus --device nvidia.com/gpu=all and -ot exps=CPU. MoE Expert weights on CPU RAM, Attention/KV cache on GPU.

Pattern C: Full GPU

All 48 layers offloaded to GPU. No expert offloading.

Results

3-Pattern Summary (128K Context, 30K+ Token Processing)

Pattern	Setup	Max PP Speed	Avg TG Speed	Total Time	Notes
A	CPU-only	100.32 t/s	20.23 t/s	879s	Pure CPU, slow for 128K
B	Hybrid (exps=CPU)	1,635.35 t/s	66.84 t/s	169s	16x PP boost over CPU
C	Full GPU	3,723.34 t/s	99.42 t/s	80s	Near 100 t/s generation

Pattern A: CPU-Only Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	427	100.32	21.51	330.4
2	980	6,284	45.55	19.85	338.1
3	2,886	2,921	48.53	19.34	210.5
Total	35,017	9,632	89.44	19.76	879.0

Pattern B: Hybrid Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	774	1,635.35	70.01	30.1
2	981	4,091	792.91	67.04	62.3
3	2,388	2,692	900.82	66.26	43.3
4	874	2,106	619.90	66.10	33.3
Total	35,394	9,663	1,453.76	66.84	168.9

16.3x PP improvement and 3.3x TG improvement over CPU-only.

Pattern C: Full GPU Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	630	3,723.34	106.67	14.3
2	981	4,325	1,638.04	99.16	44.2
3	2,373	1,918	1,619.97	97.84	21.1
Total	34,505	6,873	3,308.19	99.43	79.6

NVFP4 (vLLM) Reference

Metric	Value	Notes
Prefill	80-250 t/s (peak 459 t/s)	Peak with prefix cache
Decode	60-100 t/s (peak 112 t/s)	Stable range
TTFT (800-1100 token input)	4-6 seconds	Reduced with prefix cache
Prefix cache hit rate	20-40%	Rises with repeated agent calls

Analysis

The Hybrid Sweet Spot

Pattern B was the standout finding. Offloading only MoE Experts to CPU seems like a compromise, but TG at 67 t/s is potentially fast enough for interactive use. While Full GPU reaches 99 t/s, keeping Experts on CPU saves massive VRAM, enabling longer contexts or multi-model concurrent execution.

This is a viable strategy for GPUs under 96GB that still want MoE model benefits.

PP vs TG: Different Bottlenecks

PP (Prefill): Compute-bound. GPU parallelism scales it 37x over CPU
TG (Decode): Memory-bandwidth-bound. CPU-to-GPU improvement is “only” 5x

This asymmetry stems from MoE structure: Prefill parallelizes across batch dimensions, but Decode is sequential per-token with memory access dominating.

CPU-Only at 20 t/s Could Be a Viable Option

Pattern A’s 20 t/s exceeds human reading speed (~6 t/s). Sufficient for batch processing (Dagster pipelines), though the 5+ minutes for 30K+ token PP processing makes it unsuitable for real-time long-context use.

Lessons Learned

The Hybrid configuration (-ot exps=CPU) performed far better than expected. Even with the majority of model weights on CPU, GPU-accelerated Attention alone yields a 3.3x TG improvement. This demonstrates the maturity of ik_llama.cpp’s Expert Offload feature.

Full GPU is the clear winner for pure speed, but for homelabs running multiple models on a single GPU, the Hybrid approach offers “67 t/s while saving most of the VRAM” as a compelling trade-off.

Reproduction Steps

1. Download Model

  huggingface-cli download ubergarm/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF

2. Build ik_llama.cpp

ik_llama.cpp is a llama.cpp fork with native MLA support and Expert Offload. Build with Zen 5 optimization (-march=znver5 or -DGGML_NATIVE=ON).

3. Run (3 Patterns)

  # Common variables
IMG=compute.home.arpa/ik_llama-cpu:latest
MO=/mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
MODEL=/models/snapshots/.../GLM-4.7-Flash-IQ5_K.gguf

# Pattern A: CPU-only
podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16 \
  --host 0.0.0.0 --port 8080

# Pattern B: Hybrid - add --device nvidia.com/gpu=all -ot exps=CPU
# Pattern C: Full GPU - add --device nvidia.com/gpu=all (no -ot flag)

Technical Notes

How Expert Offload Works

In MoE models, Expert weights dominate total parameter count (GLM-4.7-Flash: 64 experts x ~1.5GB each). -ot exps=CPU places only Expert weights in CPU RAM while Attention, Embedding, and Router layers stay on GPU.

Post-selection Expert computation runs on CPU, but GPU-accelerated Attention (especially KV cache access) shifts the bottleneck, significantly improving Decode speed.

ik_llama.cpp vs llama.cpp

ik_llama.cpp provides native MLA (Multi-head Latent Attention) support, optimized for DeepSeek2/GLM-4.7 architectures. Standard llama.cpp can load the GGUF but may lack MLA-specific optimizations.

For GPUs Under 96GB

With Expert Offload, VRAM consumption drops to roughly Attention layers + KV cache (~10-15GB for GLM-4.7-Flash IQ5_K). A 24GB+ GPU should deliver TG 60+ t/s in Hybrid mode.

GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

Background link

Objective link

Test Environment link

Model Specifications link

Methodology link

Pattern A: CPU-Only link

Pattern B: Hybrid (Expert=CPU, Attention=GPU) link

Pattern C: Full GPU link

Results link

3-Pattern Summary (128K Context, 30K+ Token Processing) link

Pattern A: CPU-Only Detail link

Pattern B: Hybrid Detail link

Pattern C: Full GPU Detail link

NVFP4 (vLLM) Reference link

Analysis link

The Hybrid Sweet Spot link

PP vs TG: Different Bottlenecks link

CPU-Only at 20 t/s Could Be a Viable Option link

Lessons Learned link

Reproduction Steps link

1. Download Model link

2. Build ik_llama.cpp link

3. Run (3 Patterns) link

Technical Notes link

How Expert Offload Works link

ik_llama.cpp vs llama.cpp link

For GPUs Under 96GB link