The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed
IQuest-Coder-V1-40B-Instruct (Dense 40B) tested across CPU Q5_K_M, GPU nvfp4, and Aider whole-edit. CPU inference proved structurally challenging, nvfp4 delivers 25-28 tok/s in production range, Aider whole-edit is fundamentally incompatible with 40B. Measured data on 40B Dense operational limits.
Background
IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.
With an RTX PRO 6000 Blackwell (96GB VRAM) available, both CPU inference (Q5_K_M) and GPU inference (nvfp4) were tested, plus Aider (AI coding agent) whole-edit mode viability. Bottom line: Dense models require GPU + proper quantization. CPU inference and whole-edit turned out to be structurally challenging.
Objective
- Validate whether CPU inference (Q5_K_M / llama.cpp) is practical for 40B Dense
- Measure GPU inference (nvfp4 / vLLM) throughput and VRAM usage
- Measure Aider whole-edit code editing speed
- Clarify the practical gap between MoE and Dense in production
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) |
| Memory | DDR5-6400 768GB (12ch) |
| OS | Ubuntu 24.04 LTS |
Test Configurations
| Config | Runtime | Quantization | Placement |
|---|---|---|---|
| A: CPU | llama.cpp server (Podman) | Q5_K_M (GGUF) | CPU RAM |
| B: GPU | vLLM | nvfp4 | GPU VRAM |
| C: Aider | vLLM (Loop-Instruct variant) | nvfp4 | GPU VRAM |
Results
Config A: CPU Inference (Q5_K_M, ctx=8K)
| Item | Measured |
|---|---|
| TTFT (4K-5K prompt) | Tens of seconds (UX failure) |
| CPU usage | All cores pinned at 100% |
| Root cause | 40B full-layer computation per token, no shortcuts |
CPU inference is structurally difficult for practical use. Prompt eval dominates, making pipeline use (low-TTFT requirement) impossible. SSD placement helps load time but is irrelevant to inference speed.
Config B: GPU nvfp4 (vLLM, ctx=8K-32K)
| Metric | Measured |
|---|---|
| PP speed | 1,100-2,300 tok/s |
| TG speed | 25-28 tok/s (stable) |
| KV cache usage | 2-12% |
| Prefix cache hit rate | 20-45% |
From continuous vLLM logs:
Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%
Latency by output length:
- 200 tokens: ~7-8 seconds
- 400 tokens: ~14-16 seconds
- 800 tokens: ~30 seconds
Config C: Aider Whole-Edit
| Metric | Measured |
|---|---|
| TG speed | 0.6-8 tok/s (unstable) |
| KV cache usage | 7-13% (rapid growth) |
| Prefix cache hit rate | 6% (effectively disabled) |
Whole-edit regenerates entire files, causing token count explosion. repo-map + multi-file context inflates the prompt, and KV cache grows rapidly.
Comparison Summary
| Config | TG(tok/s) | Viability | Use Case |
|---|---|---|---|
| A: CPU Q5_K_M | Unmeasurable (TTFT failure) | No | - |
| B: GPU nvfp4 | 25-28 | Production | agent/test gen/CI |
| C: Aider whole-edit | 0.6-8 | No | - |
Analysis
Why Dense Fails on CPU
MoE models (e.g., Kimi-K2.5 with 32B active) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.
Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.
nvfp4 Efficiency
nvfp4 (4-bit quantization) keeps VRAM usage low while delivering stable 25-28 tok/s. Performance Factor comparison:
| Model | Params | TG speed | tok/s per B |
|---|---|---|---|
| command-a-reasoning | 111B | ~11 tok/s | 0.10 |
| IQuest-Coder-40B nvfp4 | 40B | 26-28 tok/s | 0.65-0.70 |
Per-B generation efficiency is ~6-7x higher than 111B-class. nvfp4 is working as intended.
Aider Whole-Edit Structural Problem
Why whole-edit is slow:
- Regenerating entire files produces massive output token counts
- repo-map + attached files inflate the prompt
- Prefix cache hit rate at 6% (constantly shifting context)
- KV cache grows rapidly (7→13%)
The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.
Lessons Learned
The position of 40B Dense models is now clear. GPU nvfp4 makes it a strong contender for agent/aider/CI/test generation. 25-28 tok/s hits a sweet spot for “doesn’t overthink” Instruct-type work, diminishing the need for 111B reasoning models in daily workflows.
CPU inference is difficult to consider practical for Dense. Without MoE’s structural advantage (expert selection reducing computation), it seems better to choose MoE models for CPU operation.
Operational conclusion:
- Daily (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
- Deep reasoning/design review: command-a-reasoning (secondary)
- Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)
Reproduction Steps
GPU nvfp4 (Recommended)
vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
--max-num-seqs 1 \
--max-model-len 32768
CPU Q5_K_M (Reference: Not Recommended)
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
--jinja -c 8192 \
--threads 14 --threads-batch 14 \
-b 2048 -ub 512 \
--parallel 1 --flash-attn on
Technical Notes
Dense vs MoE CPU Inference
| Item | Dense 40B | MoE 229B (10B active) |
|---|---|---|
| Per-token computation | All 40B layers | ~10B equivalent |
| L3 cache utilization | Ineffective (full-layer access) | Effective (expert locality) |
| CPU TG speed | Unmeasurable | 10-37 tok/s |
| CPU viability | Non-viable | Viable for batch |
Aider Optimization
--edit-format diff: Avoid whole-edit, reduce outputtemperature=0: Greedy decoding for speed- Minimize repo-map and /add targets
- Set
max_model_lento minimum required

