OpenAI Launches Training Spec to Boost Large-Scale AI

OpenAI released Multipath Reliable Connection, an open source specification for large-scale AI training networks developed alongside AMD, Broadcom, Intel, Microsoft and Nvidia.

The new spec is designed to improve GPU performance and resilience in large training clusters, responding to growing compute demands that require vast amounts of high-performance GPUs.

Using MRC, engineers can train AI models on supercomputers with more reliability and speed than ever before. In particular, OpenAI said it developed MRC to address two core challenges: reducing avoidable network congestion and minimizing the impact of inevitable hardware failures.

The protocol does this by spreading individual data transfers across hundreds of paths, meaning the data can be rerouted depending on congestion or failures in milliseconds.

“MRC helps us keep GPUs moving together through congestion, link failures, and maintenance events that would previously have disrupted training,” OpenAI said in a May 5 blog post. “At meaningful scale, that reliability and efficiency is not a nice-to-have; it is part of what makes synchronous frontier model training possible.”

Related:SAP Plans to Turn Spreadsheet AI Startup Into Top Frontier Lab

The launch is expected to feed into the company’s Stargate project, a $500B effort to build out AI infrastructure in the U.S.

“To efficiently use compute at the scale of Stargate, we need to drastically reduce complexity in every layer of the stack — including network design,” the vendor said

OpenAI has already rolled out MRC across the vendor’s supercomputers, including systems built with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers.

The vendor also said it has used the technology to train multiple frontier models using hardware from Nvidia and Broadcom.

As of today, MRC will be available under OpenAI’s Open Compute Project, for community users to deploy and tailor.

OpenAI positioned the release as part of a broader push toward shared infrastructure standards.

Menu

Categories:

Hot right now:

Follow on:

Menu

Categories:

Hot right now:

Follow on:

OpenAI Launches Training Spec to Boost Large-Scale AI

Share This Post

Related Posts

ekko Launches the Nature Footprint, Enabling Payment Providers to Embed Environmental Impact Insights and Action Into Everyday Spending

White House targets July 4 for Clarity Act passage, says crypto adviser Patrick Witt

Dominance of Tether and Circle is a net bad for stablecoins, says Bridge executive

Crypto bill won’t move without a ban on officials’ industry ties, says U.S. Senator Gillibrand

Bermuda pushes stablecoin payments with USDC airdrop as it courts crypto firms, regulators

the Senate must act on crypto market structure legislation

Categories:

Hot right now:

Company:

Follow on: