GAUDI: THE ONLY AI PROCESSOR TO DELIVER THE ADVANTAGES OF INTEGRATED, ON-CHIP RoCEv2

 

Efficient scale is the foundation of Gaudi’s architecture
Gaudi integrates ten 100 GbE ports of RoCE—RDMA over Converged Ethernet—into the processor chip to deliver unmatched advantages to customers to efficiently scale Training systems in the AI data center.

Based on industry-standard connectivity
Using industry-standard Ethernet to scale AI reduces scaling complexity with proven, efficient connectivity that’s well-established in the data center, providing flexible configuration to support scaling both data parallel and model parallel systems.

Eliminating proprietary lock-in
Using industry-standard Ethernet connectivity enables data center decision makers to avoid lock-in from proprietary solutions, giving them freedom to select AI processors to address their changing needs and take advantage of the Ethernet roadmap and diverse supplier ecosystem, inherently lowering total cost of ownership.

On-chip integration eliminates throughput bottlenecks
With RoCE native integration on the Gaudi training processor, customers avoid performance and throughput bottlenecks inherent in off-chip GPU-based systems implementation of RoCE that necessitate connectivity through a separate NIC with each processor.

OPTIONAL SERVER CONFIGURATIONS: BUILT FOR SCALE AND FLEXIBILITY


Learn More

Building High-performance AI Training Systems with Gaudi®

Habana's Eitan Medina introduced the performance and scaling benefits of GAUDI when the processor was introduced in 2019.


View the June 20, 2019 GAUDI introduction

scaling systems with hls-1h server

HLS-1H: AI SYSTEM WITH UNPRECEDENTED
SCALE-OUT BANDWIDTH

New HLS-1H system

The new HLS-1H system contains four Gaudi HL-205 cards and exposes 40 x 100GbE RoCE ports out (rather than using some ports for internal all-2-all connections like the HLS-1 does), through ten QSFP-DD connectors. Customers can connect to an external CPU using one or two PCIe connectors. As AI models grow exponentially, especially in fields of language models and recommendation systems, the HLS-1H system provides breakthrough network throughput for scaling out, which allows running model parallel training across accelerators in any scale.

AI Processing
4x HL-205
Host Interface
Dual x16 PCIe Gen 4.0
Memory
128 GB HBM2
Memory Bandwidth
4 TB/s
Scale-out interface
RDMA (RoCEv2)
40X 100 GbE
10x QSFP-DD
System Dimensions
3U Height, 19″ and 21″

Scale-out: 128X Gaudi.
32X HLS-1H POD

HLS-1 Pod | Habana
Massive RoCE

The massive RoCE connectivity makes HLS-1H especially versatile as data centers look to deploy hardware that can handle training huge models that require much more network bandwidth than smaller models using data parallel training regimes. Customers can choose the ratio of Ethernet switching capacity to AI compute for flexibly choosing the balance between compute, memory and networking in their AI infrastructure. The HLS-1H can enable scaling from a single rack to thousands of accelerators with non-blocking 10x100GbE RoCE to and from any accelerator.

Gaudi HLS-1H Datasheet

scaling systems with hls-1 server

HLS-1: 8 GAUDI AI SYSTEM

HLS-1 system

The HLS-1 system contains eight Gaudi HL-205 mezzanine cards, and can be paired with a Host CPU to form a server node. Each of the Gaudi processors dedicates seven of its ten 100 GbE RoCE ports to an all-to-all connectivity within the system, and three ports are available for scaling out for a total of 24 x100GbE RoCE ports per system. This allows end customers to scale their deployment using standard 100GbE switches.

Ports
AI Processing
8x HL-205
Host Interface
Four x16 PCIe Gen 4.0
Memory
256 GB HBM2
Memory Bandwidth
8 TB/s
Scale-out interface
RDMA (RoCEv2)
24X 100 GbE
5x QSFP-DD
System Dimensions
3U Height, 19″

Scale-out: 128X Gaudi
16X HLS-1H POD

SCALE-OUT: 128x GAUDI, 16x HLS-1 POD

The high throughput of RoCE bandwidth inside and outside the box and the unified protocol used for scaleout makes the solution easily scalable and cost effective. Customers can connect to a CPU using up to 4 PCIe connectors. The separation of the AI acceleration to a separate box enables customers to choose any ratio of CPU to AI acceleration they wish, depending on their specific application requirements.

Gaudi whitepaper for details