The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts… The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts…

NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism



Joerg Hiller
Oct 20, 2025 15:21

NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs.

NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token.

Expert Parallelism and Its Impact

Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure.

Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks.

System Design and Architecture

The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment.

Addressing Communication Overhead

Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible.

Kernel Optimization and Load Balancing

To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts to prevent over- or under-utilization of GPUs, crucial for maintaining efficiency in real-time production systems.

Implications for AI Inference

Wide-EP on NVIDIA’s NVL72 systems provides a scalable solution for MoE models, reducing weight-loading pressure and improving GroupGEMM efficiency. In testing, large EP configurations demonstrated up to 1.8x higher per-GPU throughput compared to smaller setups, highlighting the potential for significant performance gains.

The advancements in Wide-EP not only improve throughput and latency but also enhance system economics by increasing concurrency and GPU efficiency. This positions NVIDIA’s NVL72 as a pivotal player in the cost-effective deployment of trillion-parameter models, offering developers, researchers, and infrastructure teams new opportunities to optimize AI workloads.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-nvl72-revolutionizing-moe-model-scaling

Market Opportunity
Omnity Network Logo
Omnity Network Price(OCT)
$0.0234
$0.0234$0.0234
-0.12%
USD
Omnity Network (OCT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

XRP Price Prediction  — Recovery on Thin Ice as Ripple’s Global License Count Soars Past 75

XRP Price Prediction — Recovery on Thin Ice as Ripple’s Global License Count Soars Past 75

XRP Recovery Hits Resistance: $1.95 Breakout Needed to Reignite Bullish MomentumAccording to market analyst HolderStat, XRP’s rebound is at a pivotal juncture,
Share
Coinstats2026/01/24 15:11
House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case

House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case

The post House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case appeared on BitcoinEthereumNews.com. Topline House Judiciary Committee Republicans blocked a Democrat effort Wednesday to subpoena a group of major banks as part of a renewed investigation into late sex offender Jeffrey Epstein’s financial ties. Congressman Jim Jordan, R-OH, is the chairman of the committee. (Photo by Nathan Posner/Anadolu via Getty Images) Anadolu via Getty Images Key Facts A near party-line vote squashed the effort to vote on a subpoena, with Rep. Thomas Massie, R-Ky., who is leading a separate effort to force the Justice Department to release more Epstein case materials, voting alongside Democrats. The vote, if successful, would have resulted in the issuing of subpoenas to JPMorgan Chase CEO Jamie Dimon, Bank of America CEO Brian Moynihan, Deutsche Bank CEO Christian Sewing and Bank of New York Mellon CEO Robin Vince. The subpoenas would have specifically looked into multiple reports that claimed the four banks flagged $1.5 billion in suspicious transactions linked to Epstein. The failed effort from Democrats followed an FBI oversight hearing in which agency director Kash Patel misleadingly claimed the FBI cannot release many of the files it has on Epstein. Get Forbes Breaking News Text Alerts: We’re launching text message alerts so you’ll always know the biggest stories shaping the day’s headlines. Text “Alerts” to (201) 335-0739 or sign up here. Crucial Quote Dimon, who attended a lunch with Senate Republicans before the vote, according to Politico, told reporters, “We regret any association with that man at all. And, of course, if it’s a legal requirement, we would conform to it. We have no issue with that.” Chief Critic “Republicans had the chance to subpoena the CEOs of JPMorgan, Bank of America, Deutsche Bank, and Bank of New York Mellon to expose Epstein’s money trail,” the House Judiciary Democrats said in a tweet. “Instead, they tried to bury…
Share
BitcoinEthereumNews2025/09/18 08:02
Surprising February Gains Elevate Shiba Inu Over Dogecoin in Meme Coin Arena

Surprising February Gains Elevate Shiba Inu Over Dogecoin in Meme Coin Arena

The post Surprising February Gains Elevate Shiba Inu Over Dogecoin in Meme Coin Arena appeared on BitcoinEthereumNews.com. In a twist of expectations within the
Share
BitcoinEthereumNews2026/01/24 16:30