Habana Reports MLPerf Inference Results for the Goya Processor in Available Category

Today the MLPerf organization announced performance results for Inference processors and systems for the first time. Habana’s Goya has been in production since December 2018 and is reported in the Available category.

Helping readers sort out the different categories and performance metrics, MLPerf has provided an excellent paper which can be accessed here . MLPerf splits results into two divisions, closed and open. The Closed division allows for comparisons, adhering to an explicit set of rules; the open division allows vendors to more favorably showcase their solution(s) without the restrictive rules of the closed. Customers evaluating these results are also provided additional categories to help them discern which solutions are mature and commercially available (“available” category) vs. ”preview” or ”research,” categories, which include hardware/software either not yet publicly available or experimental/for research purposes. Additional data is provided for the specific hardware (processors and systems) used in running the benchmarks, including details of the number of  accelerators per solution evaluated, and the complexity of the host system used to measure the inferencing results (CPU generation, air-cooling vs. water-cooling) and more.

We believe this is a great industry initiative that will truly help customers determine which solutions are actually available for them to deploy (as opposed to just “preview”), and while the report is incomplete in scope—not yet measuring power, for instance–it is a critical first step toward the industry providing standardized, valid measures. As an example, MLPERF includes the well-established vision benchmarks such as ResNet-50 and SSD-large, but it has not yet included BERT, which has achieved state-of-the-art results on many language tasks, and as such, is very popular among cloud service providers. You may refer to our BERT results here.

A single Goya HL-102 card, using passive air cooling on an affordable host, an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, delivers the following results, in the Available category.

While MLPerf does not address the cost of solutions submitted, it does provide substantial foundational performance data to enable customers to determine which solutions should be evaluated for their specific applications and requirements.

In addition to the above results in the closed division, Habana has also contributed results that show Goya’s superior throughput under latency constraints, which benefits real-time applications such as autonomous driving. Such tests are available in the Open division here. Our open submission contribution follows the closed submission rules with only one change – more strict latency constraints for Multi-Stream scenario.

  • ResNet-50: Goya delivers 20 Samples-Per-Query (SPQ) under latency constraint of 2ms and 40 SPQ under 3.3ms latency constraint. Thus, Goya is up to 25 times faster than the required latency for the closed division (50ms).
  • SSD-large: Goya delivers 4 SPQ under latency constraint of 16.8ms and 8 SPQ under 30.8ms latency constraint, up to 4 times faster than the required latency for closed division (66ms).

Habana is proud to have actively participated in the MLPerf working group to establish performance measures and rules, and contributed workloads, driving this important effort in the service of customers and developers of technologies and products.

Additional benchmarks are available in our Goya white paper. Additional information regarding MLPerf benchmarks can be accessed here.

The MLPerf name and logo are trademarks. See www.mlperf.org for more information.

Habana Labs Goya Delivers Inferencing on BERT

Goya outperforms T4 GPU on key NLP inference benchmark

BERT (Bidirectional Encoder Representations from Transformers) is a language representation model based on the Transformer neural architecture, introduced by Google in 2018. This approach was quickly adopted by many as it offered improved accuracy, as well as further contributed to the trend of transfer-learning with a bidirectional architecture that allows the same pre-trained model to successfully tackle a broad set of NLP tasks. Habana Labs demonstrated that a single Goya HL-100 inference processor PCIe card, delivers a record throughput of 1,527 sentences per second, inferencing the BERT-BASE model, while maintaining negligible or zero accuracy loss.

In many of today’s linguistic use cases, a vast amount of data is needed for the training process. BERT solves this issue by enabling transfer-learning on a wide variety of linguistic tasks, an approach that proved to be very efficient in computer vision. The training process is split into two phases:

1) Pre-training a base model, common to a large set of tasks and use cases, wherein a general-purpose language representation is modeled, trained on an enormous amount of unannotated text in an unsupervised fashion.

2) Fine-tuning the pre-trained base model for a specific downstream task, using relatively small amounts of data. For that purpose, the model is augmented with an additional task-specific construct, to create state-of-the-art results for a wide variety of tasks such as question answering, text summation, text classification and sentiment analysis. This fine-tuning phase benefits from substantially reduced training time and significantly improved accuracy, compared to training on these datasets from scratch.

There are several reasons why the BERT workload runs so effectively on the Goya architecture: 
1. All of BERT operators are natively supported and mapped directly to the Goya hardware primitives, running without any host intervention. 

2. A mixed precision implementation is deployed in order to achieve optimized performance, using Habana’s quantization tools which set the required precision per operator to maximize performance while maintaining accuracy. BERT is amenable to quantization, with either zero or negligible accuracy loss. The BERT GEMM operations are evaluated at INT16; other operations, like Layer Normalization, are done in FP32. 

3. Goya’s heterogenous architecture is an ideal match to the BERT workload, as both engines, the GEMM engine and the Tensor Processing Cores (TPCs), are fully utilized concurrently, supporting low batch sizes at high throughput.

4. Goya’s TPC provides significant speedup when calculating BERT’s non-linear functions, such as GeLU (Gaussian Error Linear Unit).

5. Goya’s software-managed SRAM allows increased efficiency by optimizing data movement between different memory hierarchies while executing.

The mixed precision quantization resulted in a comparable accuracy to the original model trained in FP32, such that the accuracy drop is at most 0.11% (Verified on SQuAD 1.1 and MRPC tasks).

To quantify and benchmark BERT results we used Nvidia’s demo release, running a SQuAD Question answering task, identifying the answer to the input question within the paragraph.

Model used: Dataset: SQuAD; Topology: BERT BASE, Layers=12 Hidden Size=768 Heads=12 Intermediate Size=3,072 Max Seq Len = 128. 

Below are the platform configurations and results:

Goya Configuration: 

Hardware: Goya HL-100; CPU Xeon Gold 6152 at 2.1GHz Software: Ubuntu v-16.04.4; SynapseAI v-0.2.0–1173

GPU Configuration:

Hardware: T4; CPU Xeon Gold [email protected]/16GB/4 VMs Software: Ubuntu-18.04.2.x86_64-gnu; CUDA Ver 10.1, cudnn7.5; TensorRT-; 

Both the Goya and the T4 implementations are done using mixed precision of 16-bit and FP32 data types. The Habana team is working on further optimizations including uses of mixed precision data representation utilizing 8-bit data type. 

The Goya processor delivers 1.67x to 2.06x (batch 12/24 respectively) higher throughput than the T4 on the SQuAD task, all at significantly lower latency. As the BERT base model is the foundation of many NLP applications, we expect similar inference speedups for other NLP applications.

To learn more about Goya performance including other benchmark results, read the Goya whitepaper

Can you see the GOYA vs. T4 performance difference?

Yes, this is what >3x inference throughput looks like.

While the Habana GOYA™ AI Inference processor is relatively new to AI processing, having been introduced only in September 2018, its performance is redefining what customers can expect from a processor that’s custom-designed and optimized for AI inference.

On the ResNet-50 benchmark, GOYA is outpacing performance of its closest rival, the T4 processor, by a factor of more than 3. GOYA delivers 15,393 images-per-second inference throughput as opposed to the T4’s Nvidia-reported performance of 4,944 images-per-second. As you see here, 3x makes a tangible difference…3 times faster processing = 3 times quicker processing of deep learning workloads = 3 times increases in  productivity.

The key factors used in assessing inference performance are throughput/speed, power efficiency, latency and the ability to support small batch sizes. In this same ResNet-50 benchmark, GOYA offered power efficiency of 149 images-per-second-per-Watt (IPS/W) vs. T4’s power efficiency of 71 IPS/W. And, GOYA supports minimal latency of 1.01ms (well below the industry requirement of 7 milliseconds) vs. T4’s whopping 26 ms. In addition, GOYA’s performance is linear and sustained even at small batch sizes.

For more information on the GOYA AI Inference Processor, check out the whitepaper.