Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison

Background

For local LLM workflows involving chat, code generation, and MCP tool integration, quantization level selection is a recurring decision. NousResearch Hermes-4.3-36B is a 36B-class model strong in tool use (Function Calling), evaluated as a vLLM candidate.

With an RTX PRO 6000 Blackwell (96GB VRAM), BF16 (no quantization) runs but consumes 90%+ VRAM, leaving little room for context. nvfp4 (4-bit) needs only ~22GB. The question: what do you lose in exchange for speed?

Objective

Quantify generation speed, TTFT, and VRAM consumption across BF16, FP8, and nvfp4
Assess how much the “fast = smart” illusion occurs in practice
Establish per-use-case quantization selection criteria

Test Environment

Item	Specification
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
CPU	AMD EPYC 9175F
Memory	DDR5-6400 768GB
Runtime	vLLM 0.14.0rc1
Model	NousResearch / Hermes-4.3-36B

Methodology

BF16 (No Quantization)

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --max-num-seqs 1 \
  --max-model-len 65536

FP8

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --quantization fp8 \
  --max-num-seqs 1 \
  --max-model-len 65536

nvfp4 (4-bit)

  vllm serve NousResearch/Hermes-4.3-36B-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768

Same chat and code generation tasks were run on each configuration, with throughput measured from vLLM logs.

Results

Cross-Comparison

Metric	BF16	FP8	nvfp4
Generation (TG)	17-19 tok/s	18-20 tok/s	31-33 tok/s
Prefill (PP)	300-500 tok/s	1000+ tok/s	280-500 tok/s
TTFT	Slow	Improved	Good
VRAM Usage	90%+	Medium	~22GB
KV Cache Usage	6-8% (short text)	Improved	1-2% (ample headroom)
Quality/Stability	Most stable	Good	Some instability in edge cases

Key Observations

nvfp4 generation is ~1.7-2x faster than BF16. 31-33 tok/s eliminates the “waiting” feeling in conversation. At ~22GB VRAM, multi-model concurrent operation is feasible on a 96GB GPU.

FP8 Prefill is exceptionally fast. 1000+ tok/s Prefill significantly reduces TTFT when sending long System Prompts or RAG contexts. Generation speed is roughly equal to BF16.

BF16 is heavy but most reliable. Uses 90%+ VRAM limiting context length, but showed the highest consistency in long-form text and precise code modifications.

Analysis

The “Fast = Smart” Illusion

nvfp4’s snappy responses create a subjective impression of “the model got smarter.” In reality, quantization-induced quality degradation surfaces in long-form coherence and complex reasoning. This illusion is easy to miss with subjective evaluation alone.

Evaluate with objective metrics like “first-pass test success rate” rather than perceived responsiveness.

Context-Switching Design Philosophy

The most rational approach was switching by use case:

Exploratory development (nvfp4 advantage):

“Just try it” phase
Rapid iteration on code fragments
MCP + context7 conversational workflows
Short latency maintains development rhythm

Destructive changes (BF16/FP8 advantage):

Repository-wide refactoring requiring consistency
Critical logic modifications
Phases where first-pass test success rate determines efficiency
Final review stages

Tool Use Capability

Hermes-4.3-36B showed limitations in deep reasoning but was relatively stable in tool use (MCP, Function Calling). Argument specification and task chaining worked reliably, making it practical in workflows that combine LLM with external tools (static analysis, etc.).

Lessons Learned

Rather than “nvfp4 is fast but sloppy” vs “BF16 is slow but solid,” switching by development phase was the practical answer. Using nvfp4 as default and switching to BF16 only for final modifications produced the least friction.

FP8 fills the “BF16 is too heavy but 4-bit is scary” gap perfectly. Its Prefill speed (1000+ tok/s) is particularly valuable for RAG and MCP workflows with long contexts.

Reproduction Steps

1. Download Models

  # BF16/FP8
huggingface-cli download NousResearch/Hermes-4.3-36B

# nvfp4
huggingface-cli download NousResearch/Hermes-4.3-36B-nvfp4

2. Launch vLLM Server

See commands in “Methodology” section. --max-num-seqs 1 is for single-user chat. Increase for batch processing.

3. Measure

Extract Avg generation throughput and Avg prompt throughput from vLLM logs. Confirm stable values across multiple requests.

Technical Notes

FP8 Quantization in vLLM

vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model needed. Requires Blackwell-generation GPU (compute capability 12.0).

nvfp4 VRAM Estimate

36B model in nvFP4: ~22GB. Runnable on 24GB GPUs, but 32GB+ recommended for KV cache headroom. On 96GB, 2-3 nvfp4 models can be loaded simultaneously.

Quantization Selection Flowchart

VRAM insufficient -> nvfp4 (only option)
VRAM available + chat focus -> nvfp4 (speed priority)
VRAM available + code editing -> FP8 (balanced)
Final review / precision required -> BF16 (quality priority)

Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison

Background link

Objective link

Test Environment link

Methodology link

BF16 (No Quantization) link

FP8 link

nvfp4 (4-bit) link

Results link

Cross-Comparison link

Key Observations link

Analysis link

The “Fast = Smart” Illusion link

Context-Switching Design Philosophy link

Tool Use Capability link

Lessons Learned link

Reproduction Steps link

1. Download Models link

2. Launch vLLM Server link

3. Measure link

Technical Notes link

FP8 Quantization in vLLM link

nvfp4 VRAM Estimate link

Quantization Selection Flowchart link