Summary

MiniMax-2.5 (229B MoE, 8 active out of 256 experts) runs stably on 96GB VRAM via Expert Offload (-ot exps=CPU). Decode speed holds at 34-37 tok/s (IQ5_K), and dropping to IQ3_S allows more weights on GPU for improved speed. Using this inference capability, two one-shot web generation tests were validated: a React landing page (IQ4_NL) and a dental clinic static site (IQ3_S).

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C/32T, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS
Runtimeik_llama.cpp (build 4192)
ContainerPodman rootless

Model Specifications

ItemValue
ArchitectureMiniMax-M2 (MoE)
Size229B.A10B (229B total, 10B active)
Layers62
Experts256 (8 active)
Training context196,608
rope freq_base5,000,000

Part 1: Expert Offload Benchmark (IQ5_K)

Execution Command

  podman run --rm -it \
  --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" \
  --no-mmap --jinja \
  -c 65536 \
  --threads 13 --threads-batch 25 \
  -b 2048 -ub 2048 \
  -ngl 99 \
  -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --warmup-batch \
  -fa on
  

Key parameters:

  • -ot exps=CPU: Force Expert weights (bulk of 157GB) to CPU memory
  • -ngl 99: Shows 63/63 layers GPU-offloaded, but Experts override via -ot
  • --no-mmap: Pin 157GB in physical RAM, eliminate page faults
  • --warmup-batch: Run warmup on startup

Memory Layout

RegionSizeLocation
CPU buffer (Expert weights)157,356 MiB (~154GB)CPU RAM
KV cache15,872 MiB (~15.5GB)CUDA0
Compute buffer1,990 MiBCUDA0
GPU buffer (Attention, etc.)3,579 MiBCUDA0

Actual GPU VRAM usage is ~21GB. The remaining 75GB is available for KV cache expansion.

Results: 8 Consecutive Runs (65K Context)

RunPrompt(tok)PP(tok/s)Gen(tok)TG(tok/s)Total(s)
1753214.0721535.379.6
2386170.3619635.237.8
3297161.3824035.218.7
4341166.1278334.5724.7
51,264205.4673434.5327.4
6942215.2192134.3031.2
7938216.2315734.318.9
81,075176.321,35133.7646.1
MetricPP(tok/s)TG(tok/s)
Mean190.6434.66
Median190.8934.55
Min161.3833.76
Max216.2335.37

Extended Session (131K Context)

#PP(tok)TG(tok)Ctx UsedPP(tok/s)TG(tok/s)
13,227724,820268.0837.69
42,7085128,223289.8935.96
81,96519211,172313.6035.39

TG speed is extremely stable at 33.7-37.7 tok/s. No sharp degradation as context accumulates.

Analysis: Expert Offload Practicality

  • Speed stability: ~10% TG variation across 8 runs. PCIe bandwidth is the bottleneck but provides consistent rate-limiting
  • “All layers GPU offloaded” trap: Logs show “offloaded 63/63 layers to GPU” but Expert weights are actually on CPU via -ot. Display versus reality diverge
  • Prompt cache: 6,029 tokens → 1,460 MiB state, 23,104 tokens → 5,596 MiB. TTFT reduction confirmed on repeated runs
  • Tokenizer warning: special_eos_id is not in special_eog_ids — if left unaddressed, EOG/EOS judgment becomes unstable

One-Prompt Generation Live Demonstration

To verify the output quality of one-prompt generation, we recorded MiniMax-2.5 generating a website in real time. By observing the inference server output directly, you can confirm the actual generation quality.

This video demonstrates:

  • One-shot code generation process from a specification document (AGENTS.md)
  • Token-by-token generation phase output speed
  • Structure and quality of the generated code

Part 2: React LP One-Shot Generation (IQ4_NL)

Configuration

  • Quantization: IQ4_NL (lighter than IQ5_K, quality nearly maintained)
  • Context: 262,144 (256K)
  • Input: AGENTS.md (detailed site specification for loFT LLC)
  • Stack: Vite + React + TypeScript + Tailwind CSS v4

Results

MiniMax-2.5 generated a complete loFT LLC corporate site from the AGENTS.md specification in one shot.

Generated component structure (Atomic Design):

  • Atoms: Button, Heading, Text, Icon, Input, TextArea, Badge
  • Molecules: ServiceCard, CaseStudyCard, TestimonialCard, NavItem
  • Organisms: NavBar, HeroSection, ServiceGrid, ProcessTimeline, ContactForm, Footer

Implemented features:

  1. Hero section (animated background with function curves)
  2. Contact form (react-hook-form + zod validation)
  3. SEO optimization (JSON-LD structured data, dynamic meta tags, favicon generation)
  4. SPA routing (react-router-dom)

Notable: Autonomous Tailwind v4 adaptation Tailwind CSS v4 deprecated tailwind.config.js in favor of CSS-first configuration. MiniMax-2.5 noticed this from npx tailwindcss init failure logs and switched to @import "tailwindcss";.

TypeScript error handling: The model read build error output, attempted to identify and fix type import mismatches and unused variables, ultimately reaching “Build Succeeded.”

Part 3: Dental Clinic Static Site (IQ3_S GPU Full Load)

Design Intent: Why IQ3_S

Dropping to IQ3_S allows placing more Expert weights in GPU VRAM, reducing latency from -ot exps=CPU. Precision is partially traded, but the speed gain is net-positive for structured code generation tasks.

Requirements

  • 6-page static HTML site (index, services, doctors, info, visit, access)
  • No build step, Tailwind CSS (CDN), Alpine.js (CDN)

Results: 6 Pages in ~18 Minutes

PageKey ContentAlpine.js Interactions
index.htmlHero, 8 service cards, doctor previews, newsMobile nav, language switch UI
services.html14 treatment cards, FAQ sectionCategory filtering, accordion
doctors.html6 doctor profiles, specialtiesExpandable details
info.htmlFee table, insurance, payment methodsEstimate modal
visit.htmlFirst visit flow, reservationsValidated form
access.htmlMap, transit, parkingDirection tab switching
  • Total time: 18 minutes 2 seconds (design, implementation, diagnostics)
  • Diagnostics: Zero errors and warnings
  • Quality: Even at IQ3_S, accessibility (ESC key close, focus management) and semantic HTML structure were maintained

Generated Site Screenshots

Dental clinic site: Homepage hero section and service cards
Homepage — Hero + 8 Service Cards
Dental clinic site: Homepage footer with clinic hours, insurance, and first visit CTA
Homepage Bottom — Clinic Hours, First Visit CTA, Footer
Dental clinic site: Fees and insurance page
Info Page — Fee Schedule + Insurance & Payment Methods
Dental clinic site: Services page with category filter and FAQ accordion
Services Page — Category Filter + FAQ Accordion

Quantization-Level Operating Guidelines

QuantizationModel SizeTG SpeedUse Case
IQ5_K157.7 GiB34-37 tok/sBenchmarking, quality-critical generation
IQ4_NL~120 GiB (est.)ImprovedLong-context work, React LP generation
IQ3_S~100 GiB (est.)Further improvedHigh-volume structured code, speed-priority

“Running the biggest model at max precision” is not always optimal. Adjusting quantization to task characteristics and prioritizing GPU-resident weights is a practical operational strategy.

Takeaways

A 229B model running locally at 37 tok/s would have been hard to imagine a few years ago. Expert Offload — “if it doesn’t fit on GPU, move it to CPU” — is a rational approach exploiting MoE’s structural property of independent experts. Both the React LP and dental clinic site produced production-grade code from specifications. The 229B-class model’s judgment capacity demonstrably maintains cross-file consistency on complex tasks.

Reproduction

Model Download

  huggingface-cli download <quantizer>/MiniMax-M2.5-GGUF \
  --include "MiniMax-M2.5-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--MiniMax-2.5-GGUF
  

Benchmark

Refer to the “Part 1: Execution Command” section above. Requires ik_llama.cpp with Expert Offload support.

Measurement

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"minimax","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'
  

Technical Notes

–no-mmap Necessity

Reading 157GB via mmap causes intermittent page faults that destabilize latency. --no-mmap pins to physical RAM, eliminating page faults during inference. Startup time increases to several minutes.

Expert Count vs Decode Speed

MiniMax-2.5 selects 8 from 256 experts. Compared to GLM-4.7-Flash (4 from 64), Expert selection compute is larger but individual Expert size is smaller (FFN 1,536 dim). Expert count alone does not predict Decode speed.