Background

For local LLM workflows involving chat, code generation, and MCP tool integration, quantization level selection is a recurring decision. NousResearch Hermes-4.3-36B is a 36B-class model strong in tool use (Function Calling), evaluated as a vLLM candidate.

With an RTX PRO 6000 Blackwell (96GB VRAM), BF16 (no quantization) runs but consumes 90%+ VRAM, leaving little room for context. nvfp4 (4-bit) needs only ~22GB. The question: what do you lose in exchange for speed?

Objective

  1. Quantify generation speed, TTFT, and VRAM consumption across BF16, FP8, and nvfp4
  2. Assess how much the “fast = smart” illusion occurs in practice
  3. Establish per-use-case quantization selection criteria

Test Environment

ItemSpecification
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
CPUAMD EPYC 9175F
MemoryDDR5-6400 768GB
RuntimevLLM 0.14.0rc1
ModelNousResearch / Hermes-4.3-36B

Methodology

BF16 (No Quantization)

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --max-num-seqs 1 \
  --max-model-len 65536
  

FP8

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --quantization fp8 \
  --max-num-seqs 1 \
  --max-model-len 65536
  

nvfp4 (4-bit)

  vllm serve NousResearch/Hermes-4.3-36B-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768
  

Same chat and code generation tasks were run on each configuration, with throughput measured from vLLM logs.

Results

Cross-Comparison

MetricBF16FP8nvfp4
Generation (TG)17-19 tok/s18-20 tok/s31-33 tok/s
Prefill (PP)300-500 tok/s1000+ tok/s280-500 tok/s
TTFTSlowImprovedGood
VRAM Usage90%+Medium~22GB
KV Cache Usage6-8% (short text)Improved1-2% (ample headroom)
Quality/StabilityMost stableGoodSome instability in edge cases

Key Observations

nvfp4 generation is ~1.7-2x faster than BF16. 31-33 tok/s eliminates the “waiting” feeling in conversation. At ~22GB VRAM, multi-model concurrent operation is feasible on a 96GB GPU.

FP8 Prefill is exceptionally fast. 1000+ tok/s Prefill significantly reduces TTFT when sending long System Prompts or RAG contexts. Generation speed is roughly equal to BF16.

BF16 is heavy but most reliable. Uses 90%+ VRAM limiting context length, but showed the highest consistency in long-form text and precise code modifications.

Analysis

The “Fast = Smart” Illusion

nvfp4’s snappy responses create a subjective impression of “the model got smarter.” In reality, quantization-induced quality degradation surfaces in long-form coherence and complex reasoning. This illusion is easy to miss with subjective evaluation alone.

Evaluate with objective metrics like “first-pass test success rate” rather than perceived responsiveness.

Context-Switching Design Philosophy

The most rational approach was switching by use case:

Exploratory development (nvfp4 advantage):

  • “Just try it” phase
  • Rapid iteration on code fragments
  • MCP + context7 conversational workflows
  • Short latency maintains development rhythm

Destructive changes (BF16/FP8 advantage):

  • Repository-wide refactoring requiring consistency
  • Critical logic modifications
  • Phases where first-pass test success rate determines efficiency
  • Final review stages

Tool Use Capability

Hermes-4.3-36B showed limitations in deep reasoning but was relatively stable in tool use (MCP, Function Calling). Argument specification and task chaining worked reliably, making it practical in workflows that combine LLM with external tools (static analysis, etc.).

Lessons Learned

Rather than “nvfp4 is fast but sloppy” vs “BF16 is slow but solid,” switching by development phase was the practical answer. Using nvfp4 as default and switching to BF16 only for final modifications produced the least friction.

FP8 fills the “BF16 is too heavy but 4-bit is scary” gap perfectly. Its Prefill speed (1000+ tok/s) is particularly valuable for RAG and MCP workflows with long contexts.

Reproduction Steps

1. Download Models

  # BF16/FP8
huggingface-cli download NousResearch/Hermes-4.3-36B

# nvfp4
huggingface-cli download NousResearch/Hermes-4.3-36B-nvfp4
  

2. Launch vLLM Server

See commands in “Methodology” section. --max-num-seqs 1 is for single-user chat. Increase for batch processing.

3. Measure

Extract Avg generation throughput and Avg prompt throughput from vLLM logs. Confirm stable values across multiple requests.

Technical Notes

FP8 Quantization in vLLM

vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model needed. Requires Blackwell-generation GPU (compute capability 12.0).

nvfp4 VRAM Estimate

36B model in nvFP4: ~22GB. Runnable on 24GB GPUs, but 32GB+ recommended for KV cache headroom. On 96GB, 2-3 nvfp4 models can be loaded simultaneously.

Quantization Selection Flowchart

  1. VRAM insufficient -> nvfp4 (only option)
  2. VRAM available + chat focus -> nvfp4 (speed priority)
  3. VRAM available + code editing -> FP8 (balanced)
  4. Final review / precision required -> BF16 (quality priority)