DeepSeek V4 Paper Exposes FP4 Quantization Breakthrough for Trillion-Parameter MoE Architecture
DeepSeek has released the full technical paper for its V4 model, expanding on an earlier 58-page preview with substantial additional technical depth. The document outlines how the team achieved FP4 quantization-aware training (QAT) directly in late-stage training—a departure from conventional approaches that typically reserve quantization for post-training. The implementation quantizes Mixture-of-Experts (MoE) expert weights to FP4, targeting the main GPU memory consumer, while the QK path in the CSA indexer operates on FP4 activations. The result: a reported 2x speedup on QK selector operations while preserving 99.7% recall, with inference running natively on FP4 weights.
The efficiency gains are significant. According to internal benchmarks, the V4-Pro variant achieves 27% of baseline FLOPs and 10% of baseline KV cache requirements for 1M context windows. The V4-Flash variant goes further, reaching just 10% of baseline FLOPs and 7% of baseline KV cache. These reductions address the core scaling bottleneck in large language model deployment, where memory bandwidth and cache size traditionally constrain inference throughput.
The paper also documents two mechanisms for maintaining training stability at trillion-parameter scale. Trillion-parameter MoE models face inherent risk of loss spikes and divergence during training. DeepSeek's first fix—anticipatory routing—involves deliberately desynchronizing main model and router parameter updates. At each training step, feature extraction uses the latest parameters while routing decisions rely on cached older weights. This architectural adjustment reduces the coupling between routing instability and model weight divergence. The full paper details a second stability mechanism that remains partially obscured in the source material, suggesting the release may invite further technical community analysis before the complete training methodology is publicly understood.