Habana Gaudi debuts in the Amazon EC2 cloud

The numbers behind “up to 40% better price performance”

Today, AWS announced the availability of the Amazon EC2 DL1.24xlarge instances, accelerated by Habana Gaudi AI processors. This is the first AI training instance by AWS that is not based on GPUs.

The primary motivation to create this new training instance class was presented by Andy Jassy in the 2020 re:Invent: “To provide our end-customers with up to 40% better price-performance than the current generation of GPU-based instances.”

Let’s look at Gaudi’s cost-efficiency for popular computer vision and natural language processing workloads.

As AWS has published the DL1 On-Demand hourly pricing for DL1 alongside the p4d, p3dn and p3 GPU-based instances, there’s a simple way for end-users to assess the price-performance themselves. Take the latest TensorFlow Docker containers provided by both Nvidia on NGC and Habana in our software Vault and run them on the respective instances to compare the training throughput vs. the hourly pricing.

Different models will provide different results, and not all models are supported yet on Gaudi. For this evaluation, we considered two popular models: ResNet-50 and BERT-Large. 

The table below shows the training throughput, hourly pricing, and the calculated throughput per dollar (million images per $) for training TensorFlow ResNet-50 on the various instance types. Setting the performance-per-dollar on p4d.24xlarge instance as the baseline, we calculate the relative values for each of the other instance types and the corresponding percent of cost savings that the DL1 can deliver to EC2 end-customers who currently use GPU-based instances for this workload.

Habana Resnet50 AWS
(*) Measured by Habana on AWS EC2 GPU-based instances, on June 28th, using Nvidia Deep Learning AMI (Ubuntu 18.04) + Docker 21.06-tf1-py3 available at: https://ngc.nvidia.com/catalog/containers/nvidia.tensorflow
Model: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5
Your measured performance results may vary.
(**) Measured by Habana on AWS EC2 DL1.24xlarge instance, using DLAMI integrating SynapseAI 1.0.1-81 Tensorflow 2.5.1 Container at Habana’s Vault, model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras. Based on pricing published at: https://aws.amazon.com/ec2/pricing/on-demand
Your measured performance results may vary.

Based on Habana’s testing, of the various EC2 instances, and the pricing published by Amazon, we find that relative to the p4d instance, the DL1 provides 44% cost savings in training ResNet-50. For p3dn end-users, the cost saving to train ResNet-50 is 69%.

Habana recognizes the importance of MLPerf performance benchmarking and users can find our 1.0 submission results published in June for an eight Gaudi-based system – very similar to the DL1.24xlarge. With this MLPerf submission, Habana did not apply additional software optimization, such as data packing or layer fusion, to boost our performance. Rather, we aimed to submit results that are closest to the reference code and are representative of out-of-the-box performance that customers can get using our SynapseAI® TensorFlow software today. Consequently, it is easy for customers to make small adjustments to the model (change data, switch layers), while maintaining similar performance. The MLPerf  TTT results delivered on TensorFlow are similar to the training throughput our early customers see today.

While the absolute throughput per instance is lower, the Gaudi-based EC2 DL1 pricing is only a fraction of the p4d. How is this possible? While the 16nm- and HBM2-based Gaudi does not pack as many transistors as the 7nm and HBM2e based A100 GPU, Gaudi’s architecture, designed from the ground-up for efficiency, achieves higher utilization of resources and comprises fewer system components than the GPU architecture. As a result, lower system costs ultimately enable lower pricing to end users.

With language models, the price-performance improvement of Gaudi vs. GPUs is lower than in vision models, with cost savings of 10% vs. p4d and 54% vs. p3dn. BERT-Large is a popular model used today, and we used the throughput in Phase-1 as a proxy for performance that users can measure by themselves. Following are the results using recent out-of-the-box containers and model hyperparameters published in Nvidia’s NGC and in Habana’s Vault and GitHub, for TensorFlow, on the actual EC2 instances.

Habana Bert AWS Pricing
(*) Measured by Habana on AWS EC2 GPU-based instances, on June 28th, using Nvidia Deep Learning AMI (Ubuntu 18.04) + Docker 21.06-tf1-py3 available at: https://ngc.nvidia.com/catalog/containers/nvidia.tensorflow/tags
Model: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT
Your measured performance results may vary.
(**) Measured by Habana on AWS EC2 DL1.24xlarge instance, using DLAMI integrating SynapseAI  1.0.1-81 Tensorflow 2.5.1 Container at Habana’s Vault, model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/nlp/bert
Pricing published at https://aws.amazon.com/ec2/pricing/on-demand
Your measured performance results may vary.

Habana submitted MLPerf BERT results representative of out-of-the-box performance that customers can get with the SynapseAI® TensorFlow software today. Consequently, it is easy for customers to make small adjustments to the model while maintaining similar performance.

For their MLPerf BERT submission, NVIDIA utilized a series of optimizations that are not available in their released software or easily consumable for general use. For example, they fused the entire multi-head attention block into a single kernel. If a customer wishes to use a different attention for long sequences, they would have to change the kernel or they would incur a performance drop. NVIDIA also used custom data loading techniques that are not available in their standard software distribution.

Comparing BERT performance on a recent Tensorflow AMIs available from NGC (based on 21.06-tf1-py3 NGC docker container) for A100s vs. DL1, shows cost-savings even for BERT. Habana plans a submission to MLPerf next month, implementing software optimizations for BERT that will show significant performance improvement relative to the May submission.

The value proposition with Gaudi is anchored in price-performance and usability. The architectural choices Habana made were aimed at reaching higher efficiency, but not at the expense of making the migration to Gaudi difficult for end-users.

If you are interested to see what developers, who were given early access to Gaudi, have to say about Gaudi and DL1, please see Habana’s product page featuring the Amazon EC2 DL1 instances, which contains references from Seagate, Riskfuel, Leidos and others.

“We expect the significant price/performance advantage of Amazon EC2 DL1 instances, powered by Habana Gaudi accelerators, could make a compelling future addition to AWS compute clusters,” said Seagate’s Senior Engineering Director of Operations and Technology, Advanced Analytics, Darrell Louder. “As Habana Labs continues to evolve and enables broader coverage of operators, there is potential for expanding to additional enterprise use cases, and thereby harnessing additional cost savings.”

“AI and deep learning are at the core of our Machine Vision capability, enabling customers to make better decisions across industries we serve. In order to improve accuracy, data sets are becoming larger and more complex, requiring larger and more complex models. This is driving the need for improved compute price-performance,” said Srikanth Velamakanni, Group CEO of Fractal. “The new Amazon EC2 DL1 instances promise significantly lower cost training than GPU-based EC2 instances. We expect this to make training of AI models on cloud much more cost competitive and accessible than before for a broad array of clients.”

“One of the numerous technologies we are enabling to advance healthcare today is the use of machine learning and deep learning for disease diagnosis based on medical imaging data. Our massive data sets require timely and efficient training to aid researchers seeking to solve some of the most urgent medical mysteries. Given Leidos and its customers need for quick, easy, and cost-effective training for deep learning models, we are excited to have begun this journey with Intel and AWS to use Amazon EC2 DL1 instances based on Habana Gaudi AI processors. Using DL1 instances, we expect an increase in model training speed and efficiency, with a subsequent reduction in risk and cost of research and development,” said Chetan Paul, CTO Health and Human Services at Leidos.

“Two factors drew us to Amazon EC2 DL1 instances based on Habana Gaudi AI accelerators. First, we want to make sure our banking and insurance clients can run Riskfuel models that take advantage of the newest hardware. Fortunately, we found migrating our models to DL1 instances to be simple and straightforward – really, it was just a matter of changing a few lines of code. Second, training costs are a big component of our spending, and the promise of up to 40% improvement in price/performance offers potentially substantial benefit to our bottom line,” said Ryan Ferguson, CEO of Riskfuel.

Today, our reference models repository features twenty high-demand models and we have a roadmap to expand it, along with software features. In fact, you can review the roadmap, which is open to all on Habana’s GitHub.

The developer journey begins with the SynapseAI® SDK. We won’t go into the details of the SDK here; if you want to learn more, please take a look at our Documentation page.  The SynapseAI® Software Portfolio is designed to facilitate high-performance, deep learning training on Habana Gaudi accelerators. It includes the Habana graph compiler and runtime, TPC kernel library, firmware and drivers, and developer tools such as the Habana profiler and TPC SDK for custom kernel development.

SynapseAI is integrated with TensorFlow and PyTorch frameworks. TensorFlow integration is more mature as compared to the Gaudi PyTorch integration, as development on the latter began two quarters later. 

As a result, Habana PyTorch models will provide lower performance (throughput and time-to-train) compared to our similar TensorFlow models. We have documented the known limitations in our SynapseAI user guides, as well as the reference models on GitHub.   In addition, we have published the performance results for reference models on the Habana Developer site. The Habana team is committed to improving both usability and performance in subsequent releases.

We know there is much more to be done in further developing our software and models coverage, and we are counting on data scientists and developers to explore Gaudi and provide us with your feedback and requests. We are looking forward to engaging with the DL community using Gaudi in the cloud (via the Amazon EC2 DL1 instances) and on-premise, through our Developer Site and our GitHub.

What is next? We have lots of work to do on our software, and in parallel— Habana is working on our next generation Gaudi®2 AI processor, which takes the Gaudi architecture from 16nm to 7nm, further improving the price-performance for the benefit of our end-customers, while maintaining the same architecture and fully leveraging the same SynapseAI software and ecosystem we are building with Gaudi.

But today, the Habana team has the satisfaction of bringing to AI the most cost-effective training in the AWS cloud with the Gaudi-based Amazon EC2 DL1 instances. Our focus and commitment on AI is greater than ever and, in fact, you could say we’re All In on AI.