Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability
Qwen3.5-397B-A17B (397B total / 17B active MoE) deployed with IQ4_NL quantization on EPYC 9175F + GPU hybrid setup. 28 consecutive inference runs averaging TG 22.5tok/s, peak PP 372tok/s. Documenting multi-GPU tensor offload and cpu-moe execution configurations.
Background
Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.
The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer.
Objective
- Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
- Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
- Evaluate context-length dependency and stability
- Determine daily viability of 400B-class MoE
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | NVIDIA RTX PRO 6000 (96GB VRAM) |
| OS | Ubuntu 24.04 LTS |
| Runtime | ik_llama.cpp (cpu-moe enabled) |
| Quantization | IQ4_NL |
| Context | Up to 262,144 |
Results
Throughput Statistics (28 Runs)
| Metric | Prefill (PP) | Generation (TG) |
|---|---|---|
| Maximum | 372.24 tok/s | 24.04 tok/s |
| Minimum | 101.49 tok/s | 19.13 tok/s |
| Mean (Steady State) | ~160 tok/s | ~22.5 tok/s |
Representative Runs
| # | PP(tok) | TG(tok) | PP(tok/s) | TG(tok/s) | Total(s) |
|---|---|---|---|---|---|
| 1 | 4,699 | 13 | 314.22 | 21.95 | 15.5 |
| 3 | 1,125 | 2,048 | 161.67 | 20.81 | 105.4 |
| 8 | 1,124 | 2,048 | 154.76 | 23.82 | 93.3 |
| 14 | 15,866 | 2,048 | 372.24 | 22.98 | 131.7 |
| 16 | 550 | 520 | 117.78 | 22.91 | 27.4 |
Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.
Context Length Dependency
- PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
- TG speed: Remarkably stable at 21-24 tok/s regardless of context length
Hybrid Offload Configurations
Single GPU + cpu-moe
IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z "$IMG" \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 262144 --threads 13 --threads-batch 24 \
--jinja -b 2048 -ub 2048 -ngl 99 \
-fa on --no-mmap --cpu-moe
--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.
Multi-GPU Tensor Offload
With 2 GPUs, -ot enables regex-based layer distribution:
./build/bin/llama-server \
--model "$model" \
-fa on --ctx-size 135168 \
-ctk q8_0 -ctv q8_0 \
-ub 2048 -b 2048 -ngl 999 \
-ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe --threads 24 --no-mmap --jinja
Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.
Analysis
Why 397B Hits 22 tok/s
Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.
IQ4_NL Quantization Choice
Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.
Warm-up Requirement
First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.
Lessons Learned
“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.
TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code doesn’t slow generation. This matters for use cases involving entire repository ingestion.
Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.
Technical Notes
KV Cache Quantization
At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.
Sampling Parameters
For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.

