Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison
Comparing Hermes-4.3-36B across BF16, FP8, and nvfp4 on Blackwell GPU. nvfp4 runs 2x faster than BF16, but the speed-quality trade-off demands context-aware switching.
Background
For local LLM workflows involving chat, code generation, and MCP tool integration, quantization level selection is a recurring decision. NousResearch Hermes-4.3-36B is a 36B-class model strong in tool use (Function Calling), evaluated as a vLLM candidate.
With an RTX PRO 6000 Blackwell (96GB VRAM), BF16 (no quantization) runs but consumes 90%+ VRAM, leaving little room for context. nvfp4 (4-bit) needs only ~22GB. The question: what do you lose in exchange for speed?
Objective
- Quantify generation speed, TTFT, and VRAM consumption across BF16, FP8, and nvfp4
- Assess how much the “fast = smart” illusion occurs in practice
- Establish per-use-case quantization selection criteria
Test Environment
| Item | Specification |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB |
| CPU | AMD EPYC 9175F |
| Memory | DDR5-6400 768GB |
| Runtime | vLLM 0.14.0rc1 |
| Model | NousResearch / Hermes-4.3-36B |
Methodology
BF16 (No Quantization)
vllm serve NousResearch/Hermes-4.3-36B \
--dtype bfloat16 \
--max-num-seqs 1 \
--max-model-len 65536
FP8
vllm serve NousResearch/Hermes-4.3-36B \
--dtype bfloat16 \
--quantization fp8 \
--max-num-seqs 1 \
--max-model-len 65536
nvfp4 (4-bit)
vllm serve NousResearch/Hermes-4.3-36B-nvfp4 \
--max-num-seqs 1 \
--max-model-len 32768
Same chat and code generation tasks were run on each configuration, with throughput measured from vLLM logs.
Results
Cross-Comparison
| Metric | BF16 | FP8 | nvfp4 |
|---|---|---|---|
| Generation (TG) | 17-19 tok/s | 18-20 tok/s | 31-33 tok/s |
| Prefill (PP) | 300-500 tok/s | 1000+ tok/s | 280-500 tok/s |
| TTFT | Slow | Improved | Good |
| VRAM Usage | 90%+ | Medium | ~22GB |
| KV Cache Usage | 6-8% (short text) | Improved | 1-2% (ample headroom) |
| Quality/Stability | Most stable | Good | Some instability in edge cases |
Key Observations
nvfp4 generation is ~1.7-2x faster than BF16. 31-33 tok/s eliminates the “waiting” feeling in conversation. At ~22GB VRAM, multi-model concurrent operation is feasible on a 96GB GPU.
FP8 Prefill is exceptionally fast. 1000+ tok/s Prefill significantly reduces TTFT when sending long System Prompts or RAG contexts. Generation speed is roughly equal to BF16.
BF16 is heavy but most reliable. Uses 90%+ VRAM limiting context length, but showed the highest consistency in long-form text and precise code modifications.
Analysis
The “Fast = Smart” Illusion
nvfp4’s snappy responses create a subjective impression of “the model got smarter.” In reality, quantization-induced quality degradation surfaces in long-form coherence and complex reasoning. This illusion is easy to miss with subjective evaluation alone.
Evaluate with objective metrics like “first-pass test success rate” rather than perceived responsiveness.
Context-Switching Design Philosophy
The most rational approach was switching by use case:
Exploratory development (nvfp4 advantage):
- “Just try it” phase
- Rapid iteration on code fragments
- MCP + context7 conversational workflows
- Short latency maintains development rhythm
Destructive changes (BF16/FP8 advantage):
- Repository-wide refactoring requiring consistency
- Critical logic modifications
- Phases where first-pass test success rate determines efficiency
- Final review stages
Tool Use Capability
Hermes-4.3-36B showed limitations in deep reasoning but was relatively stable in tool use (MCP, Function Calling). Argument specification and task chaining worked reliably, making it practical in workflows that combine LLM with external tools (static analysis, etc.).
Lessons Learned
Rather than “nvfp4 is fast but sloppy” vs “BF16 is slow but solid,” switching by development phase was the practical answer. Using nvfp4 as default and switching to BF16 only for final modifications produced the least friction.
FP8 fills the “BF16 is too heavy but 4-bit is scary” gap perfectly. Its Prefill speed (1000+ tok/s) is particularly valuable for RAG and MCP workflows with long contexts.
Reproduction Steps
1. Download Models
# BF16/FP8
huggingface-cli download NousResearch/Hermes-4.3-36B
# nvfp4
huggingface-cli download NousResearch/Hermes-4.3-36B-nvfp4
2. Launch vLLM Server
See commands in “Methodology” section. --max-num-seqs 1 is for single-user chat. Increase for batch processing.
3. Measure
Extract Avg generation throughput and Avg prompt throughput from vLLM logs. Confirm stable values across multiple requests.
Technical Notes
FP8 Quantization in vLLM
vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model needed. Requires Blackwell-generation GPU (compute capability 12.0).
nvfp4 VRAM Estimate
36B model in nvFP4: ~22GB. Runnable on 24GB GPUs, but 32GB+ recommended for KV cache headroom. On 96GB, 2-3 nvfp4 models can be loaded simultaneously.
Quantization Selection Flowchart
- VRAM insufficient -> nvfp4 (only option)
- VRAM available + chat focus -> nvfp4 (speed priority)
- VRAM available + code editing -> FP8 (balanced)
- Final review / precision required -> BF16 (quality priority)

