Nvidia introduces innovative AI Ethernet


Introduction

At the Computex conference in Taipei, Taiwan, Nvidia CEO Jensen Huang revealed an array of fresh offerings, including a specialized ethernet switch designed to handle substantial data volumes required for AI operations.

"Incorporating backward compatibility with existing systems, how can we transform every data centre into an AI powerhouse through a new ethernet?" asked Huang during his keynote speech. "We are now bridging the gap between high-performance computing and the ethernet market," Huang added.

High-performance ethernet for AI

Nvidia claims the Spectrum-X, a new line of ethernet, is the world's first high-performance option for AI. According to Gilad Shainer, Nvidia's senior vice president of networking, a standout feature is its ability to maintain packet integrity. The initial iteration, Spectrum-4, is deemed the world's first 51Tb/sec Ethernet switch exclusively designed for AI networks. 

This switch, in conjunction with Nvidia's BlueField data processing unit (DPU) chips and fibre-optic transceivers, allows for the seamless routing of either 128 ports of 400-gigabit ethernet or 64 ports of 800-gigabit ethernet. 

During the presentation, Jensen Huang showcased the sizable Spectrum-4 ethernet switch chip, emphasizing its immense proportions with a hundred billion transistors and a 90-millimetre by 90-millimetre die manufactured using Taiwan Semiconductor Manufacturing's "4N" process technology. Huang also noted that the chip operates at 500 watts.

Disruptive Potential

Nvidia's chip-switch combination possesses the capability to revolutionize the ethernet networking industry. The predominant supplier of switch silicon is Broadcom, providing these switches to networking equipment manufacturers like Cisco Systems, Arista Networks, Extreme Networks, Juniper Networks, and others. These companies have been enhancing their equipment to effectively manage AI traffic.

Spectrum-X tackles the division of data centres into two distinct types. Firstly, there are "AI factories," requiring substantial investments with powerful GPUs utilizing Nvidia's NVLink and Infiniband technologies, primarily for training AI with a limited number of intensive workloads. Conversely, there are AI cloud facilities, built for multiple tenants, relying on ethernet, and capable of handling numerous workloads simultaneously. 

The Spectrum-X is specifically designed to cater to AI cloud needs, delivering predictions to AI consumers. According to VP Shainer, it effectively manages network traffic using a novel congestion control mechanism that prevents packet congestion in network routers' memory buffers.

Proactive Latency Management

Utilizing sophisticated telemetry, we analyze network latencies comprehensively to proactively detect potential congestion hotspots, ensuring a congestion-free environment.

According to Nvidia's statements, the most prominent hyper-scale companies worldwide are embracing the adoption of NVIDIA Spectrum-X, including pioneering cloud providers.

Nvidia is currently constructing a test-bed computer, referred to as Israel-1, in its Israel offices. This cutting-edge "generative AI supercomputer" employs Dell PowerEdge XE9680 servers equipped with H100 GPUs, effectively transferring data via the Spectrum-4 switches.

Introducing the DGX GH200

During Huang's keynote address, in addition to the groundbreaking ethernet technology, Nvidia showcased the DGX GH200, a cutting-edge addition to their renowned DGX series of AI computers. The company proudly presents the DGX GH200 as a groundbreaking solution, specifically designed to handle massive generative AI models with its exceptional large-memory capacity.

Generative AI refers to sophisticated programs that transcend mere scoring and encompass the creation of diverse outputs, ranging from text to images and other forms of artefacts. A notable example of generative AI is OpenAI's ChatGPT bot.

At the heart of the DGX GH200 lies Nvidia's revolutionary "super chip" known as the Grace Hopper board. This singular circuit board integrates the power of the Hopper GPU, alongside the Grace CPU, an ARM-based processor engineered to rival the performance of x86 CPUs from industry giants like Intel and Advanced Micro Devices (AMD). Notably, the GH200 marks the first-ever system to be equipped with this groundbreaking super chip technology, further elevating the capabilities of AI computing.

Hopper Supercomputer in Full Production

Nvidia's GH200, the first iteration of Grace Hopper, is now in full production. The DGX GH200 supercomputer, powered by 256 Superchips, offers an impressive 1 exaflops of computing power and 144 terabytes of shared memory. Global hyper scalers and supercomputing centres are among the first customers to have access to GH200-powered systems. Compared to its predecessor, the DGX A100, the DGX GH200 demonstrates a remarkable 500-fold increase in speed, as confirmed by Nvidia.

Nvidia introduces MGX reference architecture for efficient server customization with ASRock Rack, ASUS, GIGABYTE, Pegatron, QCT, and Supermicro as initial partners. QCT and Supermicro are set to launch systems in August.