Introduction

The recent years witness a trend of applying large-scale distributed deep learning algorithms in both business and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC community feels a great interest in building the HPC AI systems that are dedicated to running those workloads. The HPC AI benchmarks accelerate the process. Unfortunately, benchmarking HPC AI systems at scale raises serious challenges. None of previous HPC AI benchmarks achieve the goal of being equivalent, relevant, representative, affordable, and repeatable.

HPC AI500 presents a comprehensive methodology, tools, Roofline performance models, and innovative metrics for benchmarking, optimizing, and ranking HPC AI systems. We abstract the HPC AI system into nine independent layers, and present explicit benchmarking rules and procedures to assure equivalence of each layer, repeatability, and replicability. On the basis of AIBench--by far the most comprehensive AI benchmarks suite, we present and build two HPC AI benchmarks from both business and scientific computing: Image Classification, and Extreme Weather Analytics, achieving both representativeness and affordability. To ranking the performance and energy-efficiency of HPC AI systems, we propose Valid FLOPS, and Valid FLOPS per watt, which impose a penalty on failing to achieve the target quality. We propose using convolution and GEMM--- the two most intensively-used kernel functions of AIBench to measure the upper bound performance of the HPC AI systems, and present HPC AI roofline models for guiding performance optimizations. The evaluations show our methodology, benchmarks, performance models, and metrics can measure, optimize, and rank the HPC AI systems in a scalable, simple, and affordable way.

Methodology

The goal of HPC AI500 methodology is to achieve being equivalent, relevant, representative, affordable, and repeatable.

Equivalence

To perform fair benchmarking across different systems or the same system with different scale, we propose two approaches to assure the equivalence.

First, as shown in Figure 1, we abstract the system under test into nine independent layers, and put eachlayer under test while keeping the other layers intact unless otherwise stated.

Layer 1 is the hardware, including CPUs and networks. Layers 2, and 3 are the related system software, including the operating system (Layer 2), and the communication libraries (Layer 3). Layer 4 isthe AI accelerators, i.e., GPU, and libraries, i.e., CUDA and cuDNN. Layer 5 is the AI framework, such as TensorFlow and PyTorch . Layer 6 refers to programming model, including parallel mode(data parallelism or model parallelism), and synchronous or asynchronous training. Layer 7 refers to the workloads used in HPC AI500 V2.0 benchmark. Layer 8 refers to hyper-parameters policies or settings.Layer 9 refers to problem domain, including datasets, target quality, and epochs.

Second, for the sake of simpleness, we propose three high levels of benchmarking and put severalrelated layers together under test.

1) The hardware level. This high level is for benchmarking HPC AI hardware systems and their related system software (Layers 1, 2, 3, 4). In this context, the other layers should be kept intact unlessotherwise stated in the benchmarking rules. The benchmark users should compile the source code ofthe benchmark implementation, provided by the benchmark committee, on their hardware directly withallowed changes. Luo et al. [43] show that the same model on different frameworks has different accuracy.So in addition to the same data set, and AI model, we mandate that the benchmark implementationsalso use the same AI framework. The benchmark users can change hardware, OS, compiler settings,communication libraries. For the other layers, the benchmark users can only change parallel modes inLayer 6 or tune learning rate policies and batchsize settings in Layer 8. It is the benchmark committee’duty to assure the equivalence of Layers 6, 7, 8, 9 across different benchmark implementations upon theusers’ requests.

(2) The system level. Because of the portability cost, some benchmark users may opt for one specific AI framework without the support of the other, so specifying a fixed framework has a limited purpose. Soin the system level, we put the hardware system in addition to the AI framework under the test (Layers1, 2, 3, 4, and 5), which we call the system level. We mandate that the benchmark implementations use the same data set, and AI model. In addition to the changes allowed in the hardware level, the users areallowed to re-implement the algorithms on different or even customized AI framework (Layer 5). Theother layers should be kept intact unless otherwise stated in the benchmarking rules.The benchmark committee or an independent group need double-check the equivalence of Layers 6, 7,8, 9 between the two benchmark implementations.

(3) The free level. In this high level, the specification of an AI task is stated in a paper-and-pencilmanner separating from its specific implementation. That is to say, the same data set, target quality, andtraining epochs are defined in Layer 9 while the other layers are open for optimizations. The emphasis isadvancing the state-of-the-art of software and hardware co-design, so the benchmark users can change any layers from Layer 1 to Layer 8 while keeping Layer 9 intact. Meanwhile, the benchmark users are encouraged to disclose the details

Figure 1: The equivalent perspective of HPC AI500 V2.0 Methodology.

Representativeness

We choose AIBench Training—the most comprehensive AI benchmark byfar—as the starting point for the design and implementation of HPC AI benchmarks. The experimental results of AIBench Training have demonstrated that the seventeen AI tasks are diverse in terms of model complexity, computational cost, convergent rate, and microarchitecture characteristics covering most typical AI scenarios. To achieve representative, we identify the most typical workload in AIBench Training from both microarchitecture-independent and microarchitecture-dependent perspectives.From the microarchitecture-dependent perspective, we choose five micro-architectural metrics to profile the computation and memory access patterns of AIBench Training, including achieved occupancy, ipcefficiency, gldefficiency, gstefficiency, and dramutilization GPU architecture contains multiplestreaming multiprocessors (SM); each SM has a certain number of CUDA cores,registers, caches, warp schedulers, etc. Achievedoccupancy represents the ratio of the average active warps per active cycle to the maximum number of warps provided by a multiprocessor. Ipc efficiency indicates the ratio of the executed instructions per cycle to the theoretical number. Gld efficiency and gst efficiency represent the ratio of requested global memory load/store throughput to required global memory load/store throughput, respectively.

We profile the above five metrics and perform K-means clustering on all seventeen benchmarks to explore their similarities through our GPU experiments. We further use the T-SNE for visualization, a dimension reduction technique to embed high-dimensional data in a low-dimensional space forvisualization. Fig. 2a shows the result. The x-axis and y-axis are the Euclidean space’s position after using t-SNE to process the above five metrics. We find that these seventeen benchmarks are clustered into three classes.

From the microarchitecture-independent perspective, we analyze the algorithm behaviors, including model complexity (parameter size) and convergentrate (epochs to achieve the state-of-the-art quality), and system-level behaviors,including computational cost (FLOPs), for all seventeen workloads in AIBench Training. Further, we conduct a clustering analysis using these microarchitecture-independent performance data as input. Fig. 2b shows the clustering result.

Combing Fig. 2a and Fig. 2b, we conclude that the AIBench Traning workloads consistently cluster into three classes using both microarchitecture-dependentand microarchitecture-independent approaches.

Figure 2.a: The microarchitecture-dependent clustering. dependent

Figure 2.b: The microarchitecture-independent clustering. independent

Repeatability

Repeatability refers to the variation in repeat measurements (different runs instead of different epochs using the same benchmark implementation under the identical configurations) made on the same system under test. A good benchmark must be repeatable. Thus, repeatability is another critical criterion to select workloads for the HPC AI500 benchmarks. However, AI's nature is stochastic due to the random seed, random data traversal, non-commutative nature of floating-point addition, etc. It is hard to avoid. Thus, most AI benchmarks exhibit run-to-run variation, even using the same benchmark implementation on the same system. Therefore, we need to ensure repeatability by choosing relatively stable workloads in various AI tasks. We perform repeatability analysis using all workloads of AIBench Training as show in Table.1.

Table 1: Run-to-run Variation of Seventeen Benchmarks of AIBench Training. Note that Image-to-image and image generation variations are not reported due to a lack of a widely accepted metric to terminate an entire training session.
No. Component Benchmark Variration Runs
DC-AI-C1 Image Classification 1.12% 5
DC-AI-C2 Image Generation Not available Not available
DC-AI-C3 Text-to-Text Translation 9.38% 6
DC-AI-C4 Image-to-Text 23.52% 5
DC-AI-C5 Image-to-Image Not available Not available
DC-AI-C6 Speech Recognition 12.08% 4
DC-AI-C7 Face Embedding 5.73% 8
DC-AI-C8 3D Face Recognition 38.46% 4
DC-AI-C9 Object Detection 0 10
DC-AI-C10 Recommendation 9.95% 5
DC-AI-C11 Video Prediction 11.83% 4
DC-AI-C12 Image Compression 22.49% 4
DC-AI-C13 3D Object Reconstruction 16.07% 4
DC-AI-C14 Text Summarization 24.72% 5
DC-AI-C15 Spatial Transformer 7.29% 4
DC-AI-C16 Learning to Rank 1.90% 4
DC-AI-C17 Neural Architecture Search 6.15% 6

Keep the Benchmarks Simple

Simplicity is another important criterion for benchmarking. However, benchmarking an entire training session of all seventeen workloads in AIBench Trainingis extremely expensive. We emphasize thatImage Classification, object Detection, and Learning-to-Rank achieve not only representativeness and repeatability, but also simplicity.

The Requirements in HPC Field

Dataset: Against other domain AI benchmarks, there are two unique differences in HPC AI benchmarking. First, the challenges of HPC AI benchmarking inherit from the complexity of benchmarking scalable hardware and software systems at scale, i.e., tens of thousands of nodes, significantly different from that of IoT or datacenter. On this point, we need to make the benchmark as simple as possible, which we have discussed in detail before. Second, HPC AI domains cover both commercial and high-performance scientific computing. Currently, business applications are pervasive. Because of the difficulty of recruiting qualified scientists to label scientific data, AI for science applications lag but is promising. In general, the scientific data are often more complicated than that of the MINST or ImageNet data: the shape of scientific data can be 2D images or higher-dimension structures with hundreds of channels, while the popular commercial image data like ImageNet often consist of only RGB. So we should include the scientific data in the HPC AI benchmarks.

Computation complexity: A benchmark with a small amount of computation cannot fully utilize the performance of the HPC AI system. Therefore, we exclude Learn to Ranking because it has the lowest computation complexity in terms of FLOPS, which is only 0.08 MFLOPs in terms of a single forward computation.Image Classification and Object Detection are more complicated than that by one or two orders of magnitude, respectively.

The Finalized Benchmarks Decision

Based on the existing analysis, we can conclude that Image Classification and Object Detection are the final candidates to construct the HPC AI500 benchmark. We investigate the broad applications of Image Classification and Object Detection in both HPC and commercial field. We choose the most representative workloads and data sets from these two fields. EWA is one of the pioneering works that uses a deep learning algorithm to replace the rules predefined by a human expert and achieve excellent results. Most importantly, EWA's goal is to identify various extreme weather patterns (e.g., tropical depression), essentially object detection. In 2018, a deep learning-based EWA implementation won the Gordon Bell Prize, which is the first AI application to win this award. Image Classificationis widely used in many applications of commercial fields, which is a fundamental task in AI research. With the development of large-scale deep learning, Image Classification has become a well-known showcase optimizing HPC AI systems.

Specification

The HPC AI500 V2.0 specification and associated metrics are described in section Specification.

Numbers

See HPC AI500 Ranking.

Contributors

Prof. Jianfeng Zhan, ICT, Chinese Academy of Sciences, and BenchCouncil    
Zihan Jiang, ICT, Chinese Academy of Sciences
Dr Wanling Gao, ICT, Chinese Academy of Sciences
Dr Lei Wang, ICT, Chinese Academy of Sciences
Xingwang Xiong, ICT, Chinese Academy of Sciences
Yuchen Zhang, State University of New York at Buffalo
Xu Wen, ICT, Chinese Academy of Sciences
Chunjie Luo, ICT, Chinese Academy of Sciences
Hainan Ye, BenchCouncil and Beijing Academy of Frontier Sciences and Teconology
Xiaoyi Lu, The Ohio State University
Yunquan Zhang, National Supercomputing Center in Jinan, China
Shengzhong Feng, National Supercomputing Center in Shenzhen, China
Kenli Li, National Supercomputing Center in Changsha, China
Weijia Xu, Texas Advanced Computing Center, The Texas University at Austin

Supports

License

AIBench is available for researchers interested in AI. Software components of AIBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use AIBench are required to fully understand and abide by the licensing terms of the various components. AIBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our AIBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our AIBench you must understand and comply with their licenses. Software developed externally (not by AIBench group)

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.