Background

GLM-4.7-Flash is a 30B-A3B MoE model from THUDM (Tsinghua University), using DeepSeek2 architecture with 64 experts (4 active per token). Total parameters 30B, active 3B—lightweight yet capable with multilingual support and long context (up to 128K tokens).

On our EPYC 9175F + RTX PRO 6000 Blackwell setup, the main question was: how much performance does the “MoE Expert Offload to CPU” hybrid configuration actually deliver compared to CPU-only and Full GPU?

Objective

  1. Quantify Prefill/Decode speeds for GLM-4.7-Flash (IQ5_K) across CPU/Hybrid/Full GPU
  2. Validate the practicality of MoE Expert Offload (exps=CPU)
  3. Obtain comparison data with NVFP4 quantization on vLLM

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS
Runtime (CPU/Hybrid/GPU)ik_llama.cpp (build 4192, commit 1cb7e1bf)
Runtime (NVFP4)vLLM (OpenAI API compatible)
ModelGLM-4.7-Flash IQ5_K (GGUF, ubergarm quantization)
Context131,072 tokens (128K)

Model Specifications

ItemValue
ArchitectureDeepSeek2 (MoE)
Layers47
Experts64 (4 active)
Shared Experts1
AttentionMLA (Multi-head Latent Attention)
Training Context202,752
Vocabulary154,880

Methodology

Pattern A: CPU-Only

  ksh3@compute-server:~$ podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16
  

AVX-512 VNNI / BF16 active (AVX512 = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1). All layers on CPU.

Pattern B: Hybrid (Expert=CPU, Attention=GPU)

Same as Pattern A plus --device nvidia.com/gpu=all and -ot exps=CPU. MoE Expert weights on CPU RAM, Attention/KV cache on GPU.

Pattern C: Full GPU

All 48 layers offloaded to GPU. No expert offloading.

Results

3-Pattern Summary (128K Context, 30K+ Token Processing)

PatternSetupMax PP SpeedAvg TG SpeedTotal TimeNotes
ACPU-only100.32 t/s20.23 t/s879sPure CPU, slow for 128K
BHybrid (exps=CPU)1,635.35 t/s66.84 t/s169s16x PP boost over CPU
CFull GPU3,723.34 t/s99.42 t/s80sNear 100 t/s generation

Pattern A: CPU-Only Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,151427100.3221.51330.4
29806,28445.5519.85338.1
32,8862,92148.5319.34210.5
Total35,0179,63289.4419.76879.0

Pattern B: Hybrid Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,1517741,635.3570.0130.1
29814,091792.9167.0462.3
32,3882,692900.8266.2643.3
48742,106619.9066.1033.3
Total35,3949,6631,453.7666.84168.9

16.3x PP improvement and 3.3x TG improvement over CPU-only.

Pattern C: Full GPU Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,1516303,723.34106.6714.3
29814,3251,638.0499.1644.2
32,3731,9181,619.9797.8421.1
Total34,5056,8733,308.1999.4379.6

NVFP4 (vLLM) Reference

MetricValueNotes
Prefill80-250 t/s (peak 459 t/s)Peak with prefix cache
Decode60-100 t/s (peak 112 t/s)Stable range
TTFT (800-1100 token input)4-6 secondsReduced with prefix cache
Prefix cache hit rate20-40%Rises with repeated agent calls

Analysis

The Hybrid Sweet Spot

Pattern B was the standout finding. Offloading only MoE Experts to CPU seems like a compromise, but TG at 67 t/s is potentially fast enough for interactive use. While Full GPU reaches 99 t/s, keeping Experts on CPU saves massive VRAM, enabling longer contexts or multi-model concurrent execution.

This is a viable strategy for GPUs under 96GB that still want MoE model benefits.

PP vs TG: Different Bottlenecks

  • PP (Prefill): Compute-bound. GPU parallelism scales it 37x over CPU
  • TG (Decode): Memory-bandwidth-bound. CPU-to-GPU improvement is “only” 5x

This asymmetry stems from MoE structure: Prefill parallelizes across batch dimensions, but Decode is sequential per-token with memory access dominating.

CPU-Only at 20 t/s Could Be a Viable Option

Pattern A’s 20 t/s exceeds human reading speed (~6 t/s). Sufficient for batch processing (Dagster pipelines), though the 5+ minutes for 30K+ token PP processing makes it unsuitable for real-time long-context use.

Lessons Learned

The Hybrid configuration (-ot exps=CPU) performed far better than expected. Even with the majority of model weights on CPU, GPU-accelerated Attention alone yields a 3.3x TG improvement. This demonstrates the maturity of ik_llama.cpp’s Expert Offload feature.

Full GPU is the clear winner for pure speed, but for homelabs running multiple models on a single GPU, the Hybrid approach offers “67 t/s while saving most of the VRAM” as a compelling trade-off.

Reproduction Steps

1. Download Model

  huggingface-cli download ubergarm/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
  

2. Build ik_llama.cpp

ik_llama.cpp is a llama.cpp fork with native MLA support and Expert Offload. Build with Zen 5 optimization (-march=znver5 or -DGGML_NATIVE=ON).

3. Run (3 Patterns)

  # Common variables
IMG=compute.home.arpa/ik_llama-cpu:latest
MO=/mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
MODEL=/models/snapshots/.../GLM-4.7-Flash-IQ5_K.gguf

# Pattern A: CPU-only
podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16 \
  --host 0.0.0.0 --port 8080

# Pattern B: Hybrid - add --device nvidia.com/gpu=all -ot exps=CPU
# Pattern C: Full GPU - add --device nvidia.com/gpu=all (no -ot flag)
  

Technical Notes

How Expert Offload Works

In MoE models, Expert weights dominate total parameter count (GLM-4.7-Flash: 64 experts x ~1.5GB each). -ot exps=CPU places only Expert weights in CPU RAM while Attention, Embedding, and Router layers stay on GPU.

Post-selection Expert computation runs on CPU, but GPU-accelerated Attention (especially KV cache access) shifts the bottleneck, significantly improving Decode speed.

ik_llama.cpp vs llama.cpp

ik_llama.cpp provides native MLA (Multi-head Latent Attention) support, optimized for DeepSeek2/GLM-4.7 architectures. Standard llama.cpp can load the GGUF but may lack MLA-specific optimizations.

For GPUs Under 96GB

With Expert Offload, VRAM consumption drops to roughly Attention layers + KV cache (~10-15GB for GLM-4.7-Flash IQ5_K). A 24GB+ GPU should deliver TG 60+ t/s in Hybrid mode.