Background

After establishing stable 10-13 tok/s decode speeds with Kimi-K2.5 (1T MoE) on EPYC 9175F for batch processing, the next step was evaluating DeepSeek-V3.2 as a second-opinion model lane.

Running it in the same llama.cpp environment, Prefill (PP) was fast at 50-100 tok/s, but decode (TG) stuck at 14-15 tok/s and would not go higher. It appeared “slower” than Kimi-K2.5, but the root cause needed isolation: was it model capability or operational configuration?

Objective

  1. Identify why DeepSeek-V3.2 decode speed stalls at 14-15 tok/s
  2. Quantify the impact of cache control and optimization flags from llama.cpp logs
  3. Prioritize improvement actions

Test Environment

  • CPU: AMD EPYC 9175F (Zen 5, 16C)
  • Memory: DDR5-6400 768GB (12ch)
  • OS: Ubuntu 24.04 LTS
  • Runtime: llama.cpp (server mode, fused_moe=1)
  • Model: DeepSeek-V3.2 Speciale (MoE)
  • KV Cache: f16 (default)

Results

Measured Inference Throughput

Values extracted from logs of 4 consecutive tasks:

Task IDPP (tok)TG (tok)Cumulative TokensPP Speed (tok/s)TG Speed (tok/s)PP Time (s)TG Time (s)
02,7311,0243,75599.7514.5727.470.3
10268571,0245,63674.4715.2111.567.3
20513119826,92952.9214.515.967.7
30344,8651,02412,818100.2214.2248.572.0

PP varies between 50-100 tok/s depending on input size. TG concentrates in a tight 14.22-15.21 tok/s band, independent of input size or cumulative token count—a clear “wall.”

Cache Mismatch in Logs

  Common part does not match fully → kv cache rm [p0, end)
  

This appeared on every task. The leading token sequence differed between requests, causing KV cache to be discarded each time.

Speculative Decoding Status

  no implementations specified for speculative decoding
  

No draft model configured; Speculative Decoding was inactive.

Analysis

What the 14-15 tok/s Wall Means

The near-constant decode speed across tasks points to a physical memory bandwidth limit. KV cache is stored at f16 precision, and as context grows, attention computation becomes memory-bandwidth-dominated.

Kimi-K2.5 was run with q8_0 KV cache quantization to reduce bandwidth pressure. Applying the same setting to DeepSeek-V3.2 should improve TG speed.

Impact of Prompt Cache Mismatches

With Kimi-K2.5, a fixed prefix (System Prompt + Knowledge Digest) maintained high LCP cache hit rates. The DeepSeek-V3.2 test had three issues:

  1. <think> tag inconsistency: Thinking Prompt presence varied per request, breaking leading token alignment
  2. System Prompt variance: Templates were not locked down
  3. Context management difference: Kimi-K2.5 reused prior context; DeepSeek rebuilt context from scratch each time

The conclusion: “DeepSeek is slower” was largely “DeepSeek was tested without cache benefits.”

MoE Optimization Gap

While fused_moe=1 is active in logs, llama.cpp’s MoE implementation is more generic compared to specialized kernels in vLLM or cloud services. Expert routing implementation differences likely contribute to the speed gap.

Lessons Learned

This was a textbook “benchmark trap.” Same hardware, same runtime, but prompt cache management alone created a significant throughput difference. Initial assumption of “DeepSeek is slower than Kimi” turned out to be mostly an operational configuration issue.

The TG 14-15 tok/s plateau itself is explained by f16 KV cache settings and memory bandwidth. Applying the same q8_0 settings used for Kimi-K2.5 would likely have produced different results.

Reproduction Steps

1. Run DeepSeek-V3.2

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /path/to/deepseek-v3.2:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/DeepSeek-V3.2-Speciale.gguf \
  --cache-type-k f16 --cache-type-v f16 --flash-attn on \
  --ctx-size 16384 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

2. Improved Version (KV Cache Quantization + Fixed Prompt)

  # Change KV cache to q8_0
--cache-type-k q8_0 --cache-type-v q8_0

# Enable prompt cache
--prompt-cache /tmp/deepseek-cache.bin
  

Additionally, lock down System Prompt and <think> tag presence across all requests.

3. Measurement

Extract S_PP and S_TG from llama.cpp server logs. Compare before and after optimization.

Technical Notes

Principles for Effective Prompt Caching

  1. Fix the leading token sequence: System Prompt → Fixed Context → Variable Parts, in strict order
  2. Keep Thinking mode consistent: If enabled, enable for all requests. Toggling per-request invalidates cache every time
  3. Align generation parameters: Temperature, top_p differences can also affect cache hit rates

Fair Comparison with Kimi-K2.5

  • Match output token count, temperature, top_p, stop sequences, and stream settings exactly
  • Unify Thinking Token handling (logs show Exclude reasoning tokens for slot selection, but generation still occurs)
  • Use identical hardware, thread count, and KV cache settings

Improvement Priority

PriorityActionExpected ImpactImplementation Cost
AFix prompt prefix consistencyMajor PP reductionLow (config change)
BEnable Speculative DecodingTG perceived speed gainMedium (draft model selection)
CQuantize KV cache to q8_0TG bandwidth reliefLow (flag change)
DStandardize generation conditionsFair comparisonLow (test design)