Logo loFT LLC

  • person Profile
  • article Articles
      • Dagster + NATS Event-Driven Pipeline Design and Implementation
      • Rust + NATS + Dagster AI Factory: OpenAI Proxy, Idempotent Design, SSE Streaming, and Go Migration Record
      • Django 5 Travel Booking Site Generation Test with Qwen3.5-122B-A10B Local Inference
      • Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model
      • Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison
      • MiniMax-2.5 229B MoE with IQ5K Quantization on Blackwell GPU: 35 tok/s Generation, 65K Context Validation
      • The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed
      • MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S
      • Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability
      • Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary
      • 1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations
      • Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured
      • Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured
      • GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison
      • Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis
      • code-tree Specification, Design Intent, and Expected Effects — LLM Context Optimization Tool
      • shelpa-mcp: Design Record of a Scrapped Virtual Pipeline
      • shelpa: Design and Lessons from a Scrapped Sandbox MCP
      • Verifying ctree Refactoring Effectiveness — Project Structure Optimization
      • Building code-tree HTML Template and Markdown Scanner — Extending to Document Formats
      • Automatic Path Error Recovery MCP Tool for Local LLMs: Building pathfinder
      • Optimizing pathfinder: Model Selection, Precision Tuning, and History Correlation Validation
      • Qwen3.5-397B Autonomous Code Generation: From Dental Clinic Sites to Django CMS Foundations
      • Bilingual AI Proofreading and Translation Prompt Definitions
      • LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines
    Logo
    email Contact Us
      • Japanese
    • to navigate
    • to select
    • to close
      • Home
      • Tech Memo
      • LLM Research
      On this page
      psychology

      LLM Research

      Large language model benchmarks, CPU/GPU inference validation, and optimization research.

      info
      These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.
      info
      English translations are produced with AI assistance.
      globe

      Django 5 Travel Booking Site Generation Test with Qwen3.5-122B-A10B Local Inference

      Testing whether a locally-running Qwen3.5-122B-A10B (Q5_K_M) can generate a full-stack Django 5 web …

      psychology

      Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model

      Running Kimi-K2.5 (1T MoE) CPU-only on AMD EPYC 9175F to validate the hypothesis that massive L3 …

      psychology

      Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison

      Comparing Hermes-4.3-36B across BF16, FP8, and nvfp4 on Blackwell GPU. nvfp4 runs 2x faster than …

      psychology

      MiniMax-2.5 229B MoE with IQ5K Quantization on Blackwell GPU: 35 tok/s Generation, 65K Context Validation

      Benchmarking MiniMax-2.5 229B MoE (IQ5K) on NVIDIA RTX PRO 6000 Blackwell. Prompt evaluation …

      psychology

      The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

      IQuest-Coder-V1-40B-Instruct (Dense 40B) tested across CPU Q5_K_M, GPU nvfp4, and Aider whole-edit. …

      build

      MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S

      Complete record of running the 229B MoE model MiniMax-2.5 on EPYC 9175F + RTX PRO 6000. Expert …

      psychology

      Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability

      Qwen3.5-397B-A17B (397B total / 17B active MoE) deployed with IQ4_NL quantization on EPYC 9175F + …

      psychology

      Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

      Llama-4-Scout (17B active / 16-expert MoE) benchmarked on EPYC 9175F CPU Q6_K inference and RTX PRO …

      psychology

      1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations

      Complete CPU inference benchmark of Kimi-K2.5 (1.03T MoE, Q4_K_S/Q4_K_M) on EPYC 9175F. Why th=13 is …

      psychology

      Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

      Llama-4-Maverick (17B active / 128-expert MoE) CPU inference on EPYC 9175F, comparing Q4_K_M and …

      psychology

      Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

      Qwen3-Coder-Next (~80B MoE) benchmarked across BF16 CPU inference (7.8 tok/s), IQ4_NL Hybrid GPU …

      psychology

      GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

      Benchmarking GLM-4.7-Flash (IQ5_K GGUF) across CPU-only, MoE Expert Offload (Hybrid), and Full GPU …

      psychology

      Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis

      Analyzing why DeepSeek-V3.2 decode speed plateaus at 14-15 tok/s in llama.cpp, traced to prompt …


      © 2017-2026 loFT LLC