Background

Kimi-K2.5 is a 1.03 trillion parameter MoE model by Moonshot AI. It selects 8 out of 384 experts, keeping active parameters at ~32B. Built on the DeepSeek-V2 architecture (MLA: Multi-head Latent Attention), it achieves compressed KV cache for better memory efficiency.

At Q4_K_S quantization, RSS is ~523 GiB. At Q4_K_M, ~579 GiB. Both fit within 768GB DDR5 memory, making GPU-free CPU inference physically possible. The question is whether “physically possible” translates to “practically usable.”

Objective

  1. Benchmark Q4_K_S CPU inference speed (baseline performance)
  2. Measure thread count vs throughput relationship, identify optimal thread count
  3. Measure Q4_K_M Prefill/Decode speed at 32K context
  4. Validate Prompt Cache (LCP similarity) effectiveness
  5. Determine viability for Dagster pipeline batch operations

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
MemoryDDR5-6400 768GB (12ch)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
OSUbuntu 24.04 LTS
Runtimellama.cpp server (Podman rootless)

Model Specifications

ItemQ4_K_SQ4_K_M
Architecturedeepseek2 (MoE + MLA)Same
Total Parameters1.03TSame
Layers61Same
Experts384 (8 active)Same
QuantizationQ4_K_SQ4_K_M (4.84 bpw)
Model Size~520 GiB (RSS)578.57 GiB
Training Context262,144Same

Methodology

Benchmark Command (llama-sweep-bench)

  MODEL=/models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf
IMG=compute.home.arpa/ik_llama-cuda:latest
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  /app/llama-sweep-bench \
    --model "$MODEL" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 \
    -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 13 \
    --threads-batch 26 \
    --warmup-batch \
    -n 128
  

This command performs:

  • llama-sweep-bench: Dedicated benchmarking tool
  • -ngl 999 -ot exps=CPU: Full GPU offload with Expert weights on CPU
  • -c 131072: 131K context support
  • -ctk f16 -ctv f16: KV cache in f16 precision
  • --threads 13 --threads-batch 26: Thread configuration
  • -mla 3 -amb 512: MLA (Multi-head Latent Attention) parameters

Memory Layout (Q4_K_S)

RegionSize
KV cache (K)1,098 MiB
KV cache (V)976 MiB
CPU compute buffer348 MiB
Total RSS~523 GiB / 755 GiB
Swap usage799 MiB (no si/so activity)

Memory Layout (Q4_K_M / ctx=32K)

RegionSize
KV cache4,148 MiB (K: 2,196 / V: 1,952)
CPU compute buffer348 MiB
Model buffers578.57 GiB (13-split GGUF)

CPU Inference Live Demonstration

To verify that CPU inference is actually working on Kimi-K2.5, we’ve recorded real-time execution footage on the EPYC 9175F. By observing the standard output of the inference server, you can directly see tokens being generated as they’re computed—providing tangible proof that this large model runs on CPU alone.

This video verifies actual CPU inference output quality by capturing real-time execution on EPYC 9175F. By observing token generation streaming in the inference server’s stdout, you can validate the actual output content.

The video demonstrates:

  • llama.cpp server startup and model loading (quantized weights)
  • Prefill phase (prompt evaluation) token generation speed and content
  • Token-by-token Generate phase output
  • Verification of actual generated text quality

Results

Q4_K_S Baseline (th=14, ctx=16K)

RequestPrompt(tok)PP(tok/s)Gen(tok)TG(tok/s)Total(s)
1st (no cache)82322.2443810.2779.7
2nd (cache saved)1,33519.981,0128.76115.6
3rd (LCP hit)----cache lookup 62ms

Thread Optimization (ctx=8K)

ThreadsPP(tok/s)TG(tok/s)Assessment
1624.4312.94Maximum output (baseline)
1421.3212.50Bandwidth saturation onset
1321.5811.67Sweet spot
1214.5811.86Resource efficiency focus

Q4_K_M Long Context (th=13, ctx=32K)

RequestPrompt(tok)PP(tok/s)Gen(tok)TG(tok/s)Notes
1st16,1486.153332.44Full 16K prefill, ~44 min
2nd (LCP 0.978)3563.402,0482.26Cache hit, diff-only prefill
3rd (LCP 0.999)123.111,0242.15Near-full cache restore
4th (LCP 0.939)1,0503.211,0242.07Partial cache + diff prefill

Prompt Cache Effectiveness (Q4_K_S)

StateSizeEffect
1,260 tokens saved159.5 MiBLCP similarity > 0.5 triggers hit
Cache restore-Tens of ms (62ms measured)
TTFT reduction-Dramatic prompt eval time reduction on repeat

Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid)

Using the optimized ik_llama.cpp build with Expert weights on CPU and Attention layers on GPU (-ngl 999 -ot exps=CPU), we collected the following performance data:

Execution Command

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 32g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 \
  -n 128 \
  --threads 13 --threads-batch 26 \
  -b 2048 -ub 512 \
  -ngl 999 -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --merge-qkv -mla 3 -amb 512
  

-ngl 999 -ot exps=CPU: Offload Attention layers to GPU, Expert weights remain on CPU (Hybrid configuration)

Benchmark Results (Initial)

TaskPP(tok)TG(tok)N_KV(tok)T_PP(s)S_PP(t/s)T_TG(s)S_TG(t/s)
05,2647446,00759.59688.3337.81519.67
7477652596,28713.27757.6213.16419.68
1,0072791,0247,3316.24344.6952.45219.52
2,0321,0371,0248,36816.77261.8351.79319.77
3,0571,0413108,69516.63762.5716.12419.23
Avg----63.0-19.6

Subsequent Measurements

After applying Prompt Cache (LCP) and template optimizations:

RunPP(tok)TG(tok)N_KV(tok)T_PP(s)S_PP(t/s)T_TG(s)S_TG(t/s)Notes
15,3304015,73041.298129.0620.45819.60Fresh request
24162,2417,9868.36349.75114.55219.56Cache partial miss
32,2559198,91920.631109.3048.05619.12Cache partial miss

Metric Explanations

MetricMeaningExample
PP (Prompt eval tokens)Tokens evaluated during prefill phasePrompt length + cache diff
TG (eval tokens)Tokens evaluated during generation phaseOutput token count
N_KVTotal KV cache tokens at decode completionAccumulated cache depth
T_PP(s)Prefill duration (seconds)Input processing time
S_PP(t/s)Prefill speed (tokens/sec)PP / T_PP
T_TG(s)Generation duration (seconds)Token-by-token generation time
S_TG(t/s)Generation speed (tokens/sec)TG / T_TG

Assessment (Improvements / Limitations)

Improvements:

  • Prefill tolerance enhanced: Runs 1 and 3 achieve 100–130 t/s prefill speed. Long prompts process quickly
  • Generation speed stabilized: S_TG hovers around 19 t/s across all runs. Blackwell + Q4_K_S hits a hard ceiling
  • Cache effectiveness: Partial cache hits in Runs 2–3 maintain S_PP at 50–110 t/s

Current constraints:

  • Generation bottleneck: S_TG fixed at ~19 t/s. Perceived latency dominated by prefill time and output token count
  • Cache consistency: Runs 2–3 show “Common part does not match fully” warnings. Even minor changes to System Prompt or templates (whitespace, timestamps) fragment the cache

Next Steps (Priority Order)

1. Maximize cache efficiency ⭐ Highest impact

  • Completely fix System Prompt, tool declarations, and templates
  • Remove dynamic strings (timestamps, random IDs, session markers)
  • Systemize OpenWebUI dynamic injection to unify prompt structure
  • Effect: Runs 2–3 prefill speed approaches ideal (100+ t/s)

2. Ensure parameter alignment

  • Match server startup -n (max generation tokens) with OpenWebUI max_tokens
  • Eliminate mismatches like params.n_predict=2048 slot.n_predict=128 observed in earlier logs
  • Remove wasted buffering and computation

3. Structural generation improvements (requires alternative approach)

  • Kimi-K2.5 Q4_K_S architecture makes S_TG improvement difficult
  • Next candidates:
    • Quantization change: Q4_K_S → IQ4_XS/IQ3_M (ik_llama.cpp recommended)
    • Model size: Switch to lighter MoE model
    • Architecture: Select model with MoE activation patterns optimized for GPU

Build Verification (SM_120 Support)

To confirm Blackwell (SM 120) build is in use:

  1. Quick check: If compiled for: 520 no longer appears in server startup logs, the new build is active
  2. Thorough check: Review build log (cmake configure phase) for:
      CMAKE_CUDA_ARCHITECTURES=120
      

Analysis

Memory Bandwidth and Why th=13

Decode speed saturates at th=13-14. The 12-channel DDR5-6400 theoretical bandwidth is ~614 GB/s, but MoE random access patterns cannot fully utilize it. th=16 gives 12.94 tok/s, th=13 gives 11.67 tok/s—a 3-thread reduction for only 10% speed loss. The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence. Some have pointed out that on other multi-core EPYCs, speed scales with more cores. Consider this ratio as a good yield balance for our setup.

Long Context Reality Check

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

The practical solution:

  • Limit ctx to 16K-32K
  • Prefill a System Digest (8K-16K) once at startup
  • Use Prompt Cache (LCP similarity) for diff-only processing on subsequent requests
  • Keep output length at 1K default, 2K only when necessary

Decode 2.4 tok/s vs 10 tok/s

Q4_K_S at ctx=16K delivers 10 tok/s. Q4_K_M at ctx=32K delivers 2.4 tok/s. The gap is context-length driven—the 4.1GB KV cache attention computation becomes the bottleneck at 32K. Unsuitable for interactive chat, but batch processing tolerates the wait.

Lessons Learned

Running a 1T model on CPU is technically validated. Decode at 10 tok/s is insufficient for interactive use but fully practical for Dagster pipeline batch generation, dataset augmentation, and distillation teacher generation.

Operational conclusion: run interactive/fast inference on the GPU side (RTX PRO 6000, vLLM), while CPU llama.cpp runs at th=13 as a resident batch intelligence engine. Of 768GB memory, 523 GiB goes to the model, leaving 200GB+ for DataFrame operations and Trino queries in parallel.

Reproduction Steps

1. Download Models

  # Q4_K_S
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

# Q4_K_M
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF
  

2. Run

See commands in “Methodology” section. Requires llama.cpp with flash-attn and prompt cache support.

3. Measure

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'
  

Extract prompt eval time and eval time from server logs.

Technical Notes

Prompt Cache Design Principles

  • System Digest must be byte-identical across requests (whitespace/date changes break cache)
  • RAG context goes in user messages, not system (cache preservation is priority)

Q4_K_S vs Q4_K_M

Q4_K_M is ~60GB larger (520→579 GiB). Quality is marginally better, but speed difference is minimal. For ctx=16K batch operations, Q4_K_S is sufficient.

ParameterRecommendedRationale
ctx16,384-32,768256K is impractical
threads13Memory bandwidth saturation, frees 3 cores for pipeline
ubatch256More stable than 512
cache-ram32,768 MiBStabilizes LCP hit rate
output1,024Generation speed is the bottleneck, keep output short