MiniMax-2.5 229B MoE with IQ5K Quantization on Blackwell GPU: 35 tok/s Generation, 65K Context Validation
Benchmarking MiniMax-2.5 229B MoE (IQ5K) on NVIDIA RTX PRO 6000 Blackwell. Prompt evaluation variability (125-314 tok/s), stable generation (35-37 tok/s), KV cache behavior, and expert CPU placement impact thoroughly analyzed.
Background
MiniMax-2.5 229B MoE is a mixture-of-experts model featuring 256 experts, designed specifically for long-context processing and knowledge-intensive tasks. When evaluating local deployment options, balancing GPU memory constraints against generation speed is critical. This benchmark validates IQ5K quantization viability on Blackwell-class hardware.
The objective was to assess feasibility and performance on NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) under a 65,536-token context window setting.
Objective
- Confirm MiniMax-2.5 229B MoE IQ5K quantization executability on Blackwell hardware
- Separately measure prompt evaluation (prefill) vs. generation (decode) throughput
- Quantify the impact of expert CPU placement (
-ot exps=CPU) - Validate 65,536-token KV cache behavior and prompt cache stability
Experimental Environment
| Item | Specification |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96GB VRAM, Compute Capability 12.0) |
| CPU | Intel/AMD EPYC with AVX/AVX2/AVX512 support |
| Memory | 768GB DDR5 |
| Model | MiniMax-2.5 (minimax-m2 architecture), 229B.A10B (MoE) |
| Quantization | IQ5_K (5.5 bits per weight nominal, model size 157.77 GiB) |
| Context Length | 65,536 tokens |
| Runtime | llama.cpp (commit: 1cb7e1bf, build: 4192) |
Implementation
Launch Command
podman run --rm -it \
--device nvidia.com/gpu=all \
-p 8081:8080 \
--shm-size 16g \
--cap-add=SYS_NICE \
-v "$MO":/models:ro,Z \
$IMG \
--host 0.0.0.0 --port 8080 \
-m "$MODEL" \
--no-mmap --jinja \
-c 65536 \
--threads 13 --threads-batch 25 \
-b 2048 -ub 2048 \
-ngl 99 \
-ot exps=CPU \
-ctk f16 -ctv f16 \
--warmup-batch \
-fa on
Parameter Explanation
- -c 65536: Set context length to 65,536 tokens
- –threads 13: CPU processing threads
- -ngl 99: Offload 99 layers to GPU (near-complete layer offloading)
- -ot exps=CPU: Force MoE expert weights to CPU placement
- -fa on: Enable Flash Attention
- –no-mmap: Direct memory load without mmap
Benchmark Execution
8 request cycles with varying prompt and generation lengths measured via HTTP /chat/completions endpoint. Response times recorded end-to-end.
Results
Memory Layout (Measured)
| Memory Region | Size |
|---|---|
| CPU buffer | 157,356 MiB (IQ5K weights) |
| CUDA0 buffer | 3,578.73 MiB (computation temp) |
| KV cache (CUDA0) | 15,872 MiB (K: 7.75 GiB + V: 7.75 GiB) |
| Compute buffer (CUDA0) | 1,990 MiB |
KV cache allocated in f16 format and remains stable under 65,536-token load. GPU-side compute buffer ~2 GiB—no operational impediment.
Benchmark Results (8 Runs)
| Run | Prompt tok | PP ms | PP tok/s | Gen tok | Gen ms | Gen tok/s | Total tok | Total ms | Total tok/s |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 753 | 3,517 | 214.07 | 215 | 6,079 | 35.37 | 968 | 9,597 | 100.87 |
| 2 | 386 | 2,266 | 170.36 | 196 | 5,563 | 35.23 | 582 | 7,829 | 74.34 |
| 3 | 297 | 1,840 | 161.38 | 240 | 6,816 | 35.21 | 537 | 8,656 | 62.04 |
| 4 | 341 | 2,053 | 166.12 | 783 | 22,651 | 34.57 | 1,124 | 24,703 | 45.50 |
| 5 | 1,264 | 6,152 | 205.46 | 734 | 21,259 | 34.53 | 1,998 | 27,411 | 72.89 |
| 6 | 942 | 4,377 | 215.21 | 921 | 26,849 | 34.30 | 1,863 | 31,226 | 59.66 |
| 7 | 938 | 4,338 | 216.23 | 157 | 4,576 | 34.31 | 1,095 | 8,914 | 122.84 |
| 8 | 1,075 | 6,097 | 176.32 | 1,351 | 40,019 | 33.76 | 2,426 | 46,116 | 52.61 |
Statistical Summary
| Metric | Prompt tok/s | Gen tok/s | Total tok/s |
|---|---|---|---|
| Mean | 190.64 | 34.66 | 73.84 |
| Median | 190.89 | 34.55 | 67.46 |
| Min | 161.38 | 33.76 | 45.50 |
| Max | 216.23 | 35.37 | 122.84 |
Prompt Evaluation Variability
Prefill throughput ranges widely (125-314 tok/s) due to:
- Cache hit patterns: Reused prompts benefit from cached entries
- KV maintenance overhead: As prompt cache approaches 15.8 GiB, eviction and consistency checks incur costs
- NUMA/paging effects: 157 GiB CPU-side memory access patterns not uniform across requests
Generation Stability
Decode consistently remains at 33.76-35.37 tok/s, indicating PCIe/memory bandwidth becomes the dominant constraint as expert computation shifts CPU-ward.
Discussion
Expert CPU Placement Impact
Despite logs showing “offloaded 63/63 layers to GPU,” the -ot exps=CPU flag forces 256 expert MLPs to CPU. This results in:
- PCIe expert transfer overhead: Each decode step reads only active 8 experts CPU→GPU
- Bandwidth saturation: Expert read-write cycles dominate PCIe throughput
- Generation ceiling: The ~35 tok/s plateau signals PCIe bottleneck
Prompt Cache Operation
Case B logs confirm functional prompt cache:
- 6,029 tokens → 1,460 MiB state saved
- 23,104 tokens → 5,595 MiB state saved
- 8,192 MiB limit respected
Reusable system prompts benefit from cache hits, mitigating first-request prefill slowdown.
Flash Attention Effectiveness
With flash_attn=1 active, KV cache reaching 15.8 GiB maintains computational efficiency through optimized memory access patterns.
Conclusion
End-to-end stability from startup through HTTP acceptance through KV allocation and flash-attn activation demonstrates 65,536-token serving is technically achievable.
Performance tuning, however, is limited if -ot exps=CPU and --no-mmap remain unchanged. Minor parameter tweaks (threads, batch size) yield marginal gains. Address the expert/mmap strategy first for meaningful improvement.
The tokenizer warning (special_eos_id not in special_eog_ids) warrants investigation, as it may degrade stop condition and tag interpretation reliability.
Reproduction
1. Model Retrieval
huggingface-cli download TheBloke/MiniMax-2.5-A10B-IQ5_K-GGUF
2. Start llama.cpp Server
Follow the “Implementation” section. Replace $MODEL with gguf path and $IMG with llama.cpp container image.
3. Benchmark Measurement
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimax",
"messages": [{"role": "user", "content": "Your prompt..."}],
"max_tokens": 1024
}' | jq .usage
Extract prompt_eval_count / prompt_eval_duration and completion_tokens / completion_eval_duration to compute tok/s. Average multiple requests for stability.
Technical Notes
Expert Offload Strategies
-ot exps=CPU optimizes GPU memory (smaller footprint) but sacrifices throughput. Alternatives:
- -ot exps=GPU: Place all experts on GPU (higher VRAM, faster)
- Mixed offload: Frequent experts GPU, rare experts CPU (complex but balanced)
No-mmap Implications
--no-mmap loads all 157 GiB weights upfront, increasing startup latency and increasing paging likelihood in NUMA systems. Enabling mmap shortens init but may increase page faults depending on access patterns.
Tokenizer Configuration
The special_eos_id not in special_eog_ids warning indicates mismatch between model tokenizer.json and llama.cpp interpretation. Verify with model provider and optionally use --special-tokens-file to explicitly map special tokens.
Context Length and Batch Efficiency
Larger context windows expand KV cache memory footprint, reducing effective batch efficiency. The -b 2048 -ub 2048 config proved stable here but requires tuning for different hardware/memory configurations.
Related Topics
- MiniMax-2.5 Expert Offload and Web Generation — Quantization comparison across IQ4_NL/IQ3_S, one-shot generation of React LP and dental clinic site. Includes a video demonstrating actual generation output
- NVIDIA Blackwell compute properties (Capability 12.0)
- llama.cpp MoE support maturity
- Flash Attention KV cache optimization
- PCIe Gen 5 bandwidth and expert computation scaling

