Earlier this year, Machine Learning Algorithm Developer, Chaim Rand, evaluated the Amazon EC2 DL1 instance based on the Habana Gaudi processor and demonstrated some of its unique properties. This was a sequel to a previous post in which he discussed using dedicated AI accelerators and the potential challenges of adopting them. In his first Medium post, he recommended breaking down the task of migrating your training application to a new AI accelerator into four steps:
- High-level compatibility analysis: get an early assessment of whether the properties of your workload align with the chip specifications and the supporting software stack.
- Adjusting your model to run on the new accelerator: you may need to make some adjustments to your model such as replacing operations that are not supported by the dedicated AI accelerator.
- Optimizing the runtime performance on the new accelerator: in order to take full advantage of the solution, you will want to analyze and maximize its utilization.
- Tuning the model to converge on the new accelerator: some modifications to the model hyperparameters may be required in order to ensure timely convergence.
For the sequel, he followed these steps in his evaluation of the DL1 instance and found that the “Habana Gaudi offering, which powers the DL1 instance, demonstrated all the makings of a worthy alternative to other AI accelerators on the market today. In particular, its accompanying software stack provides the user with a great deal of flexibility in designing and optimizing machine learning workloads. Being first of its kind technology and a novice in the industry, it’s important to approach new training on Gaudi with the right mindset. Reaching optimal results may require patience and resilience,” but Chaim notes “it is well worth the potential reward.” On Chaim’s models (TensorFlow Keras), the increase in price-performance met and even exceeded the published 40% mark. The blog post was based on the most current software stack available at the time of his writing, version 1.2.0.
The SynapseAI SDK includes support for both TensorFlow and PyTorch, two of the most popular machine learning frameworks in use today. It also includes support for Horovod, a popular framework for distributed training. These offerings make model creation for Gaudi extremely accessible to the modern-day machine learning developer.
There are multiple ways to start up an Amazon EC2 instance and to set up a DL1 runtime environment. The best option depends on you or your organization’s overall cloud architecture. Adapting your model to run on Habana Gaudi requires just two lines of code.
Check out Chaim’s full article and learn more about his assessment of training workloads on DL1 instances. Gaudi means fun in German and we hope that you have as much “Gaudi” with your Gaudi as he and his team experienced.