Building High-performance AI Training Systems with Gaudi®

Habana's Eitan Medina gives a deep-dive on how AI datacenter and cloud systems can scale up and scale out with GAUDI's integrated on-chip RoCE RDMA.

View the June 20, 2019 Deep-Dive

Startup Expects to Top Nvidia V100 Performance at Half the Power

Linley Group comments on early assessments of Gaudi’s Training performance and power in its June 17 Microprocessor Processor Report. See the report.

See the report

Gaudi® Performance

A single Gaudi card dissipating 140 Watts,
delivers 1,650 images/second training throughput.

And, Gaudi’s training performance scales — from small scale servers to large-scale
deployments— with record-breaking performance.

Gaudi® vs. V100

ResNet-50 Training Throughput at Scale

Gaudi vs. V100

Based on nVidia reported MLPerf V0.5 performance metrics

Designed to Scale

Gaudi is designed for versatile and efficient system scale out and scale up
with integrated on-chip RoCE RDMA, enabling high-performance interconnectivity.
Habana’s use of standards-based connectivity gives
customers freedom from lock-in with proprietary vendor solutions.

Gaudi Chip

Check out our "Scaling AI Training" video

Gaudi® Hardware

Gaudi® HL-205 Mezzanine Card

Gaudi® HL-200 PCIe Card

Gaudi Card

Gaudi® HLS-1

HLS-1 Block Diagram

HLS-1 in a Rack

Based on the Gaudi HLS-1,
the AI Training Rack is modular and flexible to efficiently support
the growing demands on AI compute infrastructure.

HLS-1 – in a Rack

+1000 System Scale-out with Gaudi

This shows how a larger system can be built using the Gaudi system as a basic component. It shows three reduction levels—one within the system, another between 11 Gaudi systems and another between 12 islands. Altogether, this system hosts 8*11*12 = 1056 Gaudis. Larger systems can be built with an additional aggregation layer or with less bandwidth per Gaudi

Gaudi Scale

For more information on the versatility
and scalability of Habana Gaudi
GAUDI Whitepaper

Gaudi Software

SynapseAI™ Software Development Platform and Tools simplify building with or migrating to Gaudi-based systems

Profile Debugger Gaudi

Habana SynapseAI™ compiler and runtime

  • Seamlessly integrates with existing/popular frameworks
  • Can be interfaced directly using C or Python API
  • Supports both Imperative/Eager and Graph modes

SynapseAI features

  • Multi-stream execution environment
  • JIT compiler
  • Support for dynamic shapes
  • Habana Communication Library (HCL), tailored to Gaudi’s high-performance RDMA
  • Pre-written and performance optimized TPC kernels

Gaudi Platform state-of-the-art development tools

  • Enable users to quickly and easily deploy a variety of network models and algorithms.
  • Feature visual real-time performance profiling and tools for advanced users –LLVM-based C compiler, simulator, debugger and profiling capabilities
  • Facilitate the development of customized TPC kernels to augment Habana-provided kernels.