Background

IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.

With an RTX PRO 6000 Blackwell (96GB VRAM) available, both CPU inference (Q5_K_M) and GPU inference (nvfp4) were tested, plus Aider (AI coding agent) whole-edit mode viability. Bottom line: Dense models require GPU + proper quantization. CPU inference and whole-edit turned out to be structurally challenging.

Objective

  1. Validate whether CPU inference (Q5_K_M / llama.cpp) is practical for 40B Dense
  2. Measure GPU inference (nvfp4 / vLLM) throughput and VRAM usage
  3. Measure Aider whole-edit code editing speed
  4. Clarify the practical gap between MoE and Dense in production

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS

Test Configurations

ConfigRuntimeQuantizationPlacement
A: CPUllama.cpp server (Podman)Q5_K_M (GGUF)CPU RAM
B: GPUvLLMnvfp4GPU VRAM
C: AidervLLM (Loop-Instruct variant)nvfp4GPU VRAM

Results

Config A: CPU Inference (Q5_K_M, ctx=8K)

ItemMeasured
TTFT (4K-5K prompt)Tens of seconds (UX failure)
CPU usageAll cores pinned at 100%
Root cause40B full-layer computation per token, no shortcuts

CPU inference is structurally difficult for practical use. Prompt eval dominates, making pipeline use (low-TTFT requirement) impossible. SSD placement helps load time but is irrelevant to inference speed.

Config B: GPU nvfp4 (vLLM, ctx=8K-32K)

MetricMeasured
PP speed1,100-2,300 tok/s
TG speed25-28 tok/s (stable)
KV cache usage2-12%
Prefix cache hit rate20-45%

From continuous vLLM logs:

  Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%
  

Latency by output length:

  • 200 tokens: ~7-8 seconds
  • 400 tokens: ~14-16 seconds
  • 800 tokens: ~30 seconds

Config C: Aider Whole-Edit

MetricMeasured
TG speed0.6-8 tok/s (unstable)
KV cache usage7-13% (rapid growth)
Prefix cache hit rate6% (effectively disabled)

Whole-edit regenerates entire files, causing token count explosion. repo-map + multi-file context inflates the prompt, and KV cache grows rapidly.

Comparison Summary

ConfigTG(tok/s)ViabilityUse Case
A: CPU Q5_K_MUnmeasurable (TTFT failure)No-
B: GPU nvfp425-28Productionagent/test gen/CI
C: Aider whole-edit0.6-8No-

Analysis

Why Dense Fails on CPU

MoE models (e.g., Kimi-K2.5 with 32B active) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.

Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.

nvfp4 Efficiency

nvfp4 (4-bit quantization) keeps VRAM usage low while delivering stable 25-28 tok/s. Performance Factor comparison:

ModelParamsTG speedtok/s per B
command-a-reasoning111B~11 tok/s0.10
IQuest-Coder-40B nvfp440B26-28 tok/s0.65-0.70

Per-B generation efficiency is ~6-7x higher than 111B-class. nvfp4 is working as intended.

Aider Whole-Edit Structural Problem

Why whole-edit is slow:

  1. Regenerating entire files produces massive output token counts
  2. repo-map + attached files inflate the prompt
  3. Prefix cache hit rate at 6% (constantly shifting context)
  4. KV cache grows rapidly (7→13%)

The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.

Lessons Learned

The position of 40B Dense models is now clear. GPU nvfp4 makes it a strong contender for agent/aider/CI/test generation. 25-28 tok/s hits a sweet spot for “doesn’t overthink” Instruct-type work, diminishing the need for 111B reasoning models in daily workflows.

CPU inference is difficult to consider practical for Dense. Without MoE’s structural advantage (expert selection reducing computation), it seems better to choose MoE models for CPU operation.

Operational conclusion:

  • Daily (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
  • Deep reasoning/design review: command-a-reasoning (secondary)
  • Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)

Reproduction Steps

  vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768
  
  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  --jinja -c 8192 \
  --threads 14 --threads-batch 14 \
  -b 2048 -ub 512 \
  --parallel 1 --flash-attn on
  

Technical Notes

Dense vs MoE CPU Inference

ItemDense 40BMoE 229B (10B active)
Per-token computationAll 40B layers~10B equivalent
L3 cache utilizationIneffective (full-layer access)Effective (expert locality)
CPU TG speedUnmeasurable10-37 tok/s
CPU viabilityNon-viableViable for batch

Aider Optimization

  • --edit-format diff: Avoid whole-edit, reduce output
  • temperature=0: Greedy decoding for speed
  • Minimize repo-map and /add targets
  • Set max_model_len to minimum required