GAUDI: THE ONLY AI PROCESSOR TO DELIVER THE ADVANTAGES OF INTEGRATED, ON-CHIP RoCEv2
Efficient scale is the foundation of Gaudi’s architecture
Gaudi integrates ten 100 GbE ports of RoCE—RDMA over Converged Ethernet—into the processor chip to deliver unmatched advantages to customers to efficiently scale Training systems in the AI data center.
Based on industry-standard connectivity
Using industry-standard Ethernet to scale AI reduces scaling complexity with proven, efficient connectivity that’s well-established in the data center, providing flexible configuration to support scaling both data parallel and model parallel systems.
Eliminating proprietary lock-in
Using industry-standard Ethernet connectivity enables data center decision makers to avoid lock-in from proprietary solutions, giving them freedom to select AI processors to address their changing needs and take advantage of the Ethernet roadmap and diverse supplier ecosystem, inherently lowering total cost of ownership.
On-chip integration eliminates throughput bottlenecks
With RoCE native integration on the Gaudi training processor, customers avoid performance and throughput bottlenecks inherent in off-chip GPU-based systems implementation of RoCE that necessitate connectivity through a separate NIC with each processor.
Building High-performance AI Training Systems with Gaudi®
Habana's Eitan Medina introduced the performance and scaling benefits of GAUDI when the processor was introduced in 2019.
View the June 20, 2019 GAUDI introduction
scaling systems with hls-1h server
HLS-1H: AI SYSTEM WITH UNPRECEDENTED
SCALE-OUT BANDWIDTH
New HLS-1H system
The new HLS-1H system contains four Gaudi HL-205 cards and exposes 40 x 100GbE RoCE ports out (rather than using some ports for internal all-2-all connections like the HLS-1 does), through ten QSFP-DD connectors. Customers can connect to an external CPU using one or two PCIe connectors. As AI models grow exponentially, especially in fields of language models and recommendation systems, the HLS-1H system provides breakthrough network throughput for scaling out, which allows running model parallel training across accelerators in any scale.

- AI Processing
- 4x HL-205
- Host Interface
- Dual x16 PCIe Gen 4.0
- Memory
- 128 GB HBM2
- Memory Bandwidth
- 4 TB/s
- Scale-out interface
- RDMA (RoCEv2)
40X 100 GbE
10x QSFP-DD - System Dimensions
- 3U Height, 19″ and 21″
Scale-out: 128X Gaudi.
32X HLS-1H POD

Massive RoCE
The massive RoCE connectivity makes HLS-1H especially versatile as data centers look to deploy hardware that can handle training huge models that require much more network bandwidth than smaller models using data parallel training regimes. Customers can choose the ratio of Ethernet switching capacity to AI compute for flexibly choosing the balance between compute, memory and networking in their AI infrastructure. The HLS-1H can enable scaling from a single rack to thousands of accelerators with non-blocking 10x100GbE RoCE to and from any accelerator.
scaling systems with hls-1 server
HLS-1: 8 GAUDI AI SYSTEM
HLS-1 system
The HLS-1 system contains eight Gaudi HL-205 mezzanine cards, and can be paired with a Host CPU to form a server node. Each of the Gaudi processors dedicates seven of its ten 100 GbE RoCE ports to an all-to-all connectivity within the system, and three ports are available for scaling out for a total of 24 x100GbE RoCE ports per system. This allows end customers to scale their deployment using standard 100GbE switches.

- AI Processing
- 8x HL-205
- Host Interface
- Four x16 PCIe Gen 4.0
- Memory
- 256 GB HBM2
- Memory Bandwidth
- 8 TB/s
- Scale-out interface
- RDMA (RoCEv2)
24X 100 GbE
5x QSFP-DD - System Dimensions
- 3U Height, 19″
Scale-out: 128X Gaudi
16X HLS-1H POD

The high throughput of RoCE bandwidth inside and outside the box and the unified protocol used for scaleout makes the solution easily scalable and cost effective. Customers can connect to a CPU using up to 4 PCIe connectors. The separation of the AI acceleration to a separate box enables customers to choose any ratio of CPU to AI acceleration they wish, depending on their specific application requirements.