Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis

Background

After establishing stable 10-13 tok/s decode speeds with Kimi-K2.5 (1T MoE) on EPYC 9175F for batch processing, the next step was evaluating DeepSeek-V3.2 as a second-opinion model lane.

Running it in the same llama.cpp environment, Prefill (PP) was fast at 50-100 tok/s, but decode (TG) stuck at 14-15 tok/s and would not go higher. It appeared “slower” than Kimi-K2.5, but the root cause needed isolation: was it model capability or operational configuration?

Objective

Identify why DeepSeek-V3.2 decode speed stalls at 14-15 tok/s
Quantify the impact of cache control and optimization flags from llama.cpp logs
Prioritize improvement actions

Test Environment

CPU: AMD EPYC 9175F (Zen 5, 16C)
Memory: DDR5-6400 768GB (12ch)
OS: Ubuntu 24.04 LTS
Runtime: llama.cpp (server mode, fused_moe=1)
Model: DeepSeek-V3.2 Speciale (MoE)
KV Cache: f16 (default)

Results

Measured Inference Throughput

Values extracted from logs of 4 consecutive tasks:

Task ID	PP (tok)	TG (tok)	Cumulative Tokens	PP Speed (tok/s)	TG Speed (tok/s)	PP Time (s)	TG Time (s)
0	2,731	1,024	3,755	99.75	14.57	27.4	70.3
1026	857	1,024	5,636	74.47	15.21	11.5	67.3
2051	311	982	6,929	52.92	14.51	5.9	67.7
3034	4,865	1,024	12,818	100.22	14.22	48.5	72.0

PP varies between 50-100 tok/s depending on input size. TG concentrates in a tight 14.22-15.21 tok/s band, independent of input size or cumulative token count—a clear “wall.”

Cache Mismatch in Logs

  Common part does not match fully → kv cache rm [p0, end)

This appeared on every task. The leading token sequence differed between requests, causing KV cache to be discarded each time.

Speculative Decoding Status

  no implementations specified for speculative decoding

No draft model configured; Speculative Decoding was inactive.

Analysis

What the 14-15 tok/s Wall Means

The near-constant decode speed across tasks points to a physical memory bandwidth limit. KV cache is stored at f16 precision, and as context grows, attention computation becomes memory-bandwidth-dominated.

Kimi-K2.5 was run with q8_0 KV cache quantization to reduce bandwidth pressure. Applying the same setting to DeepSeek-V3.2 should improve TG speed.

Impact of Prompt Cache Mismatches

With Kimi-K2.5, a fixed prefix (System Prompt + Knowledge Digest) maintained high LCP cache hit rates. The DeepSeek-V3.2 test had three issues:

<think> tag inconsistency: Thinking Prompt presence varied per request, breaking leading token alignment
System Prompt variance: Templates were not locked down
Context management difference: Kimi-K2.5 reused prior context; DeepSeek rebuilt context from scratch each time

The conclusion: “DeepSeek is slower” was largely “DeepSeek was tested without cache benefits.”

MoE Optimization Gap

While fused_moe=1 is active in logs, llama.cpp’s MoE implementation is more generic compared to specialized kernels in vLLM or cloud services. Expert routing implementation differences likely contribute to the speed gap.

Lessons Learned

This was a textbook “benchmark trap.” Same hardware, same runtime, but prompt cache management alone created a significant throughput difference. Initial assumption of “DeepSeek is slower than Kimi” turned out to be mostly an operational configuration issue.

The TG 14-15 tok/s plateau itself is explained by f16 KV cache settings and memory bandwidth. Applying the same q8_0 settings used for Kimi-K2.5 would likely have produced different results.

Reproduction Steps

1. Run DeepSeek-V3.2

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /path/to/deepseek-v3.2:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/DeepSeek-V3.2-Speciale.gguf \
  --cache-type-k f16 --cache-type-v f16 --flash-attn on \
  --ctx-size 16384 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

2. Improved Version (KV Cache Quantization + Fixed Prompt)

  # Change KV cache to q8_0
--cache-type-k q8_0 --cache-type-v q8_0

# Enable prompt cache
--prompt-cache /tmp/deepseek-cache.bin

Additionally, lock down System Prompt and <think> tag presence across all requests.

3. Measurement

Extract S_PP and S_TG from llama.cpp server logs. Compare before and after optimization.

Technical Notes

Principles for Effective Prompt Caching

Fix the leading token sequence: System Prompt → Fixed Context → Variable Parts, in strict order
Keep Thinking mode consistent: If enabled, enable for all requests. Toggling per-request invalidates cache every time
Align generation parameters: Temperature, top_p differences can also affect cache hit rates

Fair Comparison with Kimi-K2.5

Match output token count, temperature, top_p, stop sequences, and stream settings exactly
Unify Thinking Token handling (logs show Exclude reasoning tokens for slot selection, but generation still occurs)
Use identical hardware, thread count, and KV cache settings

Improvement Priority

Priority	Action	Expected Impact	Implementation Cost
A	Fix prompt prefix consistency	Major PP reduction	Low (config change)
B	Enable Speculative Decoding	TG perceived speed gain	Medium (draft model selection)
C	Quantize KV cache to q8_0	TG bandwidth relief	Low (flag change)
D	Standardize generation conditions	Fair comparison	Low (test design)

Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis

Background link

Objective link

Test Environment link

Results link

Measured Inference Throughput link

Cache Mismatch in Logs link

Speculative Decoding Status link

Analysis link

What the 14-15 tok/s Wall Means link

Impact of Prompt Cache Mismatches link

MoE Optimization Gap link

Lessons Learned link

Reproduction Steps link

1. Run DeepSeek-V3.2 link

2. Improved Version (KV Cache Quantization + Fixed Prompt) link

3. Measurement link

Technical Notes link

Principles for Effective Prompt Caching link

Fair Comparison with Kimi-K2.5 link

Improvement Priority link