CoreWeave has reported record-breaking results in the MLPerf Training v6.0 benchmark suite, specifically achieving the fastest training performance for the DeepSeek-V3 671B model. The results were achieved using the same production cloud infrastructure available to customers, demonstrating the performance of large-scale NVIDIA GB300 NVL72 clusters.
DeepSeek-V3 Training on NVIDIA GB300 NVL72 Clusters
CoreWeave utilized the largest GB300 cluster submitted in the v6.0 round to train the DeepSeek-V3 671B model, one of the most computationally demanding models benchmarked. Using 8,192 NVIDIA GB300 NVL72 GPUs across 2,048 nodes, the company reached the target quality in 2.02 minutes.
The company submitted three different GB300 NVL72 configurations to test scaling efficiency. When the cluster was scaled down to 4,096 GPUs across 1,024 nodes, training time increased to 3.09 minutes. At 2,048 GPUs across 512 nodes, the time was 5.54 minutes. CoreWeave was the only submitter in this round to scale a GB300 platform beyond 2,048 GPUs for the DeepSeek-V3 workload.
Performance Across Llama and GPT-OSS Models
The benchmark results extended to other model architectures and cluster sizes. A 4,096-GPU deployment on NVIDIA GB300 NVL72 reached the Llama-3.1-405B reference quality target in 9.77 minutes. This performance achieved near-parity with larger GB200 deployments while utilizing 20% fewer GPUs. This specific run utilized the NVIDIA NeMo Framework Release 26.04, NVIDIA Spectrum-X Ethernet running RoCE for scale-out fabric, and tailored Tensor, pipeline, and context-parallel sharding.
Additionally, CoreWeave tested a compact 8-node, 64-GPU NVIDIA HGX B200 cluster connected via InfiniBand. This configuration trained GPT-OSS-20B in 26.98 minutes and Llama-3.1-8B in 16.54 minutes. According to the company, these results validate that its engineering optimizations benefit customers at various scales, not only at the frontier scale.
Full-Stack Infrastructure and Orchestration Optimizations
CoreWeave attributed these results to a full-stack optimization strategy involving networking, orchestration, and scheduling. The company highlighted three specific technical components:
- CoreWeave Mission Control: This platform performs continuous health checks across rack-scale systems, including GB300, to validate firmware, network, thermal, and hardware health. This process is intended to reduce "stragglers" and maintain a consistent performance baseline.
- CoreWeave SUNK: This topology-aware scheduler places workloads to maximize locality. It co-locates expert-parallel groups within the same NVL72 domain to minimize inter-rack communication for Mixture-of-Experts (MoE) workloads.
- Rail-Aware Networking: This strategy balances traffic to prevent hotspots within the fabric and ensure efficient bandwidth utilization at a multi-thousand-GPU scale.
The company stated that the networking fabric, storage architecture, and orchestration used in the MLPerf tests are the same systems currently used by its customers for real-world workloads.
Key Takeaways
- CoreWeave trained the DeepSeek-V3 671B model in 2.02 minutes using 8,192 NVIDIA GB300 NVL72 GPUs.
- A 4,096-GPU GB300 deployment reached Llama-3.1-405B quality targets in 9.77 minutes, using 20% fewer GPUs than larger GB200 deployments for similar results.
- The company's 64-GPU NVIDIA HGX B200 cluster trained Llama-3.1-8B in 16.54 minutes and GPT-OSS-20B in 26.98 minutes.
TechInsyte's Take
These results signal that the gap between theoretical hardware peak and real-world efficiency is increasingly dependent on the orchestration layer and topology-aware scheduling. For B2B buyers, the near-linear scaling observed across different cluster sizes suggests that compute budgets may be utilized more efficiently if the cloud provider optimizes for the specific model architecture. Executives should monitor whether these "full-stack" gains translate consistently across diverse, non-benchmark workloads.
Source: CoreWeave