Is it necessary for two separate consortiums to be established for scale-up and scale-out AI systems? In other words, could the Ultra Ethernet Consortium (UEC) and the Ultra Accelerator Link (UAL) Alliance merge into a single organization?
UEC is dedicated to advancing Ethernet technology to meet the needs of scale-out AI/HPC applications. It works to improve Ethernet’s bandwidth, reduce latency, and increase efficiency by developing standards and enhancing hardware, thereby facilitating high-performance communication among thousands of interconnected nodes.
The UAL Alliance aims to provide a set of specifications and standards that enable the industry to develop high-speed interconnect technology for AI accelerators, allowing multiple AI accelerators to work together as a single, tightly coupled unit and boosting the overall performance of the system.
UAL’s GPU-to-GPU interconnect specification was initially influenced by AMD’s Infinity Fabric, which uses a physical and data link layer similar to PCIe to achieve ultra-low latency. However, this strict low-latency requirement often results in lower interconnect bandwidth compared to Ethernet. Nvidia recognized this early on and avoided using PCIe semantics in its NVLink protocol. An NVLink 5.0 channel’s operating bandwidth of 200 Gbps is three times the bandwidth of PCIe Gen6.
In distributed training/inference workloads, scale-up architectures primarily handle high-bandwidth tensor parallel traffic, which carries the results of partial matrix multiplications. To improve efficiency, computing frameworks can pipeline these multiplication operations or overlap other computations with the transfer of results.
In high-performance computing (HPC) workloads, GPU memory is aggregated to form a large, unified memory pool. However, this approach faces challenges such as GPU thread stalls due to cache misses when data resides on another GPU. Compilers can mitigate these stalls by using known distributed computing techniques to overlap computation with communication, which alleviates the need for ultra-low latency.
The UAL team recognized that bandwidth should be prioritized over latency. In scale-up mode, they chose to run Gen6 technology at a rate of 128 Gbps. However, as data transfer rates increase, the probability of signal errors during transmission also rises. To address this, stronger Forward Error Correction (FEC) techniques are needed.
There are rumors that the UAL team is currently moving from a PCIe-based approach to an Ethernet-like physical link to compete with Nvidia’s 200 Gbps channel speeds. Ethernet-style SerDes, with PAM4 and potentially PAM6 for 400 Gbps and higher SerDes, achieves longer transmission distances and higher bandwidth, but requires powerful FEC to handle the higher error rates, which adds latency. However, higher bandwidth allows more accelerators to be connected to a single switch, and the interconnect can use copper cables to span racks, as demonstrated by Nvidia at GTC 2024, where a system with 72 GPUs was connected between servers using copper cables for scale-up.
This raises the question: why not adopt a unified scale-up and scale-out mechanism? Can the mechanisms developed by UEC be applied to a scale-up network?
UEC’s new transport protocol runs on top of Ethernet/IP, but IP routing is not needed for scale-up-type switching. The IP routing overhead (a total of 66 bytes, including the transport protocol header) is excessive for scale-up traffic, which primarily involves memory read/write and atomic operations. Although UEC is considering a compressed header of about 50 bytes for HPC workloads, the overhead is still significant. Additionally, using a UEC-compliant Ethernet switch as a scale-up switch is inefficient in terms of area and power consumption, as these switches are co-located with high-power GPUs inside servers and have strict power constraints.
An alternative would be to use an Ethernet-like SerDes as the physical layer, which has strong FEC capabilities to handle higher pre-FEC error rates, better equalization techniques, larger deskew buffers, and uses a custom transport protocol optimized for memory operations, similar to Nvidia’s NVLink protocol. NVLink defines read/write and atomic operations with flits ranging from 64 to 256 bytes (with the next generation potentially reaching up to 1000 bytes), using a 16-byte header for commands, Cyclic Redundancy Check (CRC), and control fields. This results in an efficiency of about 94% for a 256-byte transfer, compared to only 80% for an Ethernet link. CXL has similar semantics for memory operations. Any new protocol would likely adopt similar semantics for flit-swapping between GPU memories.
In addition to leading in scale-up architecture technology, Nvidia’s advantage also lies in having a unified software framework and API for both scale-up and scale-out, such as SHARP, whose main goal is to offload and accelerate complex collective operations directly within scale-up and scale-out networks. This reduces the amount of data that needs to be transmitted over the network and thereby lowers overall communication time. SHARP is supported by both NVLink switches and scale-out Quantum InfiniBand switches. They may soon add SHARP support to their Ethernet switches as well.
UEC is developing a specification called In-Network Collectives (INC) as an alternative. UAL may also need to define a similar specification. Placing both under one umbrella would allow for the unified software API development for INC and other SW APIs, leveraging similar components in both scale-up and scale-out networks. Some UEC hardware features, such as link-level retries, credit-based data transmission, and encryption/decryption standards, could also be used for scale-up expansion in HPC configurations.
Broadcom previously withdrew from UAL, reportedly because they disagreed with the initial approach of using a PCIe-style interconnect. Now that UAL’s direction may have changed, will Broadcom rejoin UAL? Or will it proceed with developing scale-up specifications on its own or through UEC? If it’s the latter, the scale-up domain will remain fragmented, delaying widespread adoption.
Without industry consensus on scale-up and scale-out specifications, it will be very difficult to compete with Nvidia. It would be beneficial for both alliances to either merge into one or collaborate to accelerate the release of open standards for scale-up and scale-out systems.