The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

Background

IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.

With an RTX PRO 6000 Blackwell (96GB VRAM) available, both CPU inference (Q5_K_M) and GPU inference (nvfp4) were tested, plus Aider (AI coding agent) whole-edit mode viability. Bottom line: Dense models require GPU + proper quantization. CPU inference and whole-edit turned out to be structurally challenging.

Objective

Validate whether CPU inference (Q5_K_M / llama.cpp) is practical for 40B Dense
Measure GPU inference (nvfp4 / vLLM) throughput and VRAM usage
Measure Aider whole-edit code editing speed
Clarify the practical gap between MoE and Dense in production

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS

Test Configurations

Config	Runtime	Quantization	Placement
A: CPU	llama.cpp server (Podman)	Q5_K_M (GGUF)	CPU RAM
B: GPU	vLLM	nvfp4	GPU VRAM
C: Aider	vLLM (Loop-Instruct variant)	nvfp4	GPU VRAM

Results

Config A: CPU Inference (Q5_K_M, ctx=8K)

Item	Measured
TTFT (4K-5K prompt)	Tens of seconds (UX failure)
CPU usage	All cores pinned at 100%
Root cause	40B full-layer computation per token, no shortcuts

CPU inference is structurally difficult for practical use. Prompt eval dominates, making pipeline use (low-TTFT requirement) impossible. SSD placement helps load time but is irrelevant to inference speed.

Config B: GPU nvfp4 (vLLM, ctx=8K-32K)

Metric	Measured
PP speed	1,100-2,300 tok/s
TG speed	25-28 tok/s (stable)
KV cache usage	2-12%
Prefix cache hit rate	20-45%

From continuous vLLM logs:

  Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%

Latency by output length:

200 tokens: ~7-8 seconds
400 tokens: ~14-16 seconds
800 tokens: ~30 seconds

Config C: Aider Whole-Edit

Metric	Measured
TG speed	0.6-8 tok/s (unstable)
KV cache usage	7-13% (rapid growth)
Prefix cache hit rate	6% (effectively disabled)

Whole-edit regenerates entire files, causing token count explosion. repo-map + multi-file context inflates the prompt, and KV cache grows rapidly.

Comparison Summary

Config	TG(tok/s)	Viability	Use Case
A: CPU Q5_K_M	Unmeasurable (TTFT failure)	No	-
B: GPU nvfp4	25-28	Production	agent/test gen/CI
C: Aider whole-edit	0.6-8	No	-

Analysis

Why Dense Fails on CPU

MoE models (e.g., Kimi-K2.5 with 32B active) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.

Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.

nvfp4 Efficiency

nvfp4 (4-bit quantization) keeps VRAM usage low while delivering stable 25-28 tok/s. Performance Factor comparison:

Model	Params	TG speed	tok/s per B
command-a-reasoning	111B	~11 tok/s	0.10
IQuest-Coder-40B nvfp4	40B	26-28 tok/s	0.65-0.70

Per-B generation efficiency is ~6-7x higher than 111B-class. nvfp4 is working as intended.

Aider Whole-Edit Structural Problem

Why whole-edit is slow:

Regenerating entire files produces massive output token counts
repo-map + attached files inflate the prompt
Prefix cache hit rate at 6% (constantly shifting context)
KV cache grows rapidly (7→13%)

The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.

Lessons Learned

The position of 40B Dense models is now clear. GPU nvfp4 makes it a strong contender for agent/aider/CI/test generation. 25-28 tok/s hits a sweet spot for “doesn’t overthink” Instruct-type work, diminishing the need for 111B reasoning models in daily workflows.

CPU inference is difficult to consider practical for Dense. Without MoE’s structural advantage (expert selection reducing computation), it seems better to choose MoE models for CPU operation.

Operational conclusion:

Daily (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
Deep reasoning/design review: command-a-reasoning (secondary)
Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)

Reproduction Steps

GPU nvfp4 (Recommended)

  vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768

CPU Q5_K_M (Reference: Not Recommended)

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  --jinja -c 8192 \
  --threads 14 --threads-batch 14 \
  -b 2048 -ub 512 \
  --parallel 1 --flash-attn on

Technical Notes

Dense vs MoE CPU Inference

Item	Dense 40B	MoE 229B (10B active)
Per-token computation	All 40B layers	~10B equivalent
L3 cache utilization	Ineffective (full-layer access)	Effective (expert locality)
CPU TG speed	Unmeasurable	10-37 tok/s
CPU viability	Non-viable	Viable for batch

Aider Optimization

--edit-format diff: Avoid whole-edit, reduce output
temperature=0: Greedy decoding for speed
Minimize repo-map and /add targets
Set max_model_len to minimum required

The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

Background link

Objective link

Test Environment link

Test Configurations link

Results link

Config A: CPU Inference (Q5_K_M, ctx=8K) link

Config B: GPU nvfp4 (vLLM, ctx=8K-32K) link

Config C: Aider Whole-Edit link

Comparison Summary link

Analysis link

Why Dense Fails on CPU link

nvfp4 Efficiency link

Aider Whole-Edit Structural Problem link

Lessons Learned link

Reproduction Steps link

GPU nvfp4 (Recommended) link

CPU Q5_K_M (Reference: Not Recommended) link

Technical Notes link

Dense vs MoE CPU Inference link

Aider Optimization link