OpenAI Launches Training Spec to Boost Large-Scale AI

Share This Post

OpenAI released Multipath Reliable Connection, an open source specification for large-scale AI training networks developed alongside AMD, Broadcom, Intel, Microsoft and Nvidia.

The new spec is designed to improve GPU performance and resilience in large training clusters, responding to growing compute demands that require vast amounts of high-performance GPUs.

Using MRC, engineers can train AI models on supercomputers with more reliability and speed than ever before. In particular, OpenAI said it developed MRC to address two core challenges: reducing avoidable network congestion and minimizing the impact of inevitable hardware failures. 

The protocol does this by spreading individual data transfers across hundreds of paths, meaning the data can be rerouted depending on congestion or failures in milliseconds.

“MRC helps us keep GPUs moving together through congestion, link failures, and maintenance events that would previously have disrupted training,” OpenAI said in a May 5 blog post. “At meaningful scale, that reliability and efficiency is not a nice-to-have; it is part of what makes synchronous frontier model training possible.”

Related:SAP Plans to Turn Spreadsheet AI Startup Into Top Frontier Lab

The launch is expected to feed into the company’s Stargate project, a $500B effort to build out AI infrastructure in the U.S.

“To efficiently use compute at the scale of Stargate, we need to drastically reduce complexity in every layer of the stack — including network design,” the vendor said

OpenAI has already rolled out MRC across the vendor’s supercomputers, including systems built with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers

The vendor also said it has used the technology to train multiple frontier models using hardware from Nvidia and Broadcom. 

As of today, MRC will be available under OpenAI’s Open Compute Project, for community users to deploy and tailor.

OpenAI positioned the release as part of a broader push toward shared infrastructure standards.

Related Posts