Summary

The AI advancements have brought breakthroughs in processing images, video, speech, and audio, boosting industry-scale deployments of massive AI algorithms, systems, and architectures. Unfortunately, the AI training’s learning dynamics are dynamic, volatile, and unpredictable: a slight change of models, hyper-parameters, or optimizing strategies may influence the final accuracy or the training convergence rate, which is not understood well theoretically. Meanwhile, the benchmarks’ shelf-life is short as state-of-the-art AI models evolve very fast. These situations raise severe AI benchmarking challenges.

First, the prohibitive cost of training a state-of-the-art AI model raises a serious benchmarking challenge. Some mixed-precision optimizations indeed improve traditional performance metrics like throughput. Still, they adversely affect the final model’s quality, which we can only observe by running an entire training session—training an AI model (a component benchmark) to achieve a state-of-the-art quality target. So running an entire training session is mandatory. Unfortunately, it is prohibitively costly, often taking several weeks to run a complete training session on a small-scale system. Furthermore, the architecture community heavily relies upon simulations with slowdowns varying wildly from 10X to 1000X, which further exaggerates the challenge.

Second, there are conflicting benchmarking requirements: affordable vs. comprehensive in different stages. On the one hand, earlier-stage evaluations of a new architecture or system need affordable AI benchmarks to reduce the portability cost. Meanwhile, affordable benchmarks are also necessary to provide valuable performance implications in ranking off-the-shelf systems or architectures for promoting its adoption.

On the other hand, later-stage evaluations or purchasing off-the-shelf systems need detailed evaluations using comprehen-sive benchmarks to avoid benchmarketing, and using a few AI component benchmarks like MLPerf alone may lead to misleading or unfair conclusions in the other stages. For example, our experiments find that TPU reflects hugely high performance for Image Classification, while supports limited models officially considering the huge portability cost, which is not the case for GPUs. Meanwhile, the initial design input or workload characterization needs to consider various computation and memory access patterns to avoid over-optimization for some specific workloads.

There are other shelf-life, scalability, and repeatability challenges. AIBench is a systematic AI benchmarking project tackling the challenges mentioned above. It distills and abstracts real-world appli-cation scenarios across Datacenter, HPC, IoT, and Edge, into the scenario, training, inference, micro, and synthetic benchmarks. This page describes AIBench Training and its subsets.

We present a balanced methodology to meet conflicting benchmarking requirements in different stages. We use real-world benchmarks to cover the factors space that impacts the learning dynamics to the most considerable extent. The factors which we consider include the commonly used building blocks, model layers, loss functions, optimizers, FLOPs, and different-scale parameter sizes. We identify and include nineteen representative AI tasks from one of the essential domains—Internet Services to guarantee the benchmarks’ representativeness and diversity.

We keep two subsets for repeatable performance ranking (RPR subset) and workload characterization (WC subset) to a minimum for affordability. The criteria for the RPR subset are the diversity of model complexity, computational cost, convergence rate, repeatability, and having widely-accepted metrics or not. The criteria for the WC subset are of micro-architectural characteristics. We currently consider occupancy, IPC, global load, global store, and dram utilization, which are GPU architecture-dependent. Table 1 summarizes the differences of AIBench Training v1.1 against MLPerf Training v0.7.

The Methodology

AIBench Training adopts a balanced AI benchmarking methodology considering comprehensiveness, representativeness, affordability, and portability. Our methodology widely investigates AI tasks and models and covers the algorithm-level, system-level, and microarchitectural-level factors space to the most considerable extent. From the algorithm level, we consider the commonly used building blocks, model layers, loss functions, optimizers, FLOPs, and different-scale parameter sizes; From the system level, we evaluate the convergent rate and hot functions. From the microarchitectural level, we identify their computation and memory access patterns. We provide real-world AI workloads to achieve comprehensiveness and representativeness. Besides, we propose two AIBench Training subset: RPR and WC subsets to achieve affordability. We give the hotspot functions as microbenchmarks (AIBench Micro) to achieve portability for simulator-based research after profiling.

  • Performing a detailed survey of the critical domain rather than a rough survey of a variety of domains

    As it is impossible to investigate all AI domains, we single out one of the essential AI domains–Internet services for the detailed survey. In cooperation with seventeen prominent industry partners, our survey covers diverse business types like Internet services, streaming videos, machine translation, Q&A community, online payment, etc.

  • Include as most as possible representative benchmarks

    We believe the prohibitive cost of training a model to a state-of-the-art quality cannot justify including only a few AI benchmarks. Instead, using only a few AI component benchmarks may lead to error-prone design: over-optimization for some specific workloads or benchmarketing.

    For Internet services, to the best of our knowledge, we identify and include as most as possible representative AI tasks, models, and datasets into the benchmark suite to guarantee benchmarks’ representativeness and diversity. Meanwhile, we consider the primitives’ diversity in the AI models.

    The past successful benchmark practice also witnesses this strategy. The cost of execution time for other benchmarks like HPC, SPECCPU on simulators, is also prohibitively costly. However, the representativeness and coverage of a widely accepted benchmark suite are paramount important. For example, SPEC- CPU 2017 contains 43 benchmarks. The other examples include PARSEC3.0 (30), TPC-DS (99).

  • Keep the benchmark subsets to a minimum

    For two different purposes, we choose two minimum subsets according to different criteria: We choose the RPR subset based on diversity of model complexity, computational cost, convergence rate, repeatability, having the widely-accepted metrics or not, and choose the WC subset according to the representativeness of system or micro-architecture characteristics.

    Using the subset for ranking is also witnessed by the past practice. For example, Top500—a super computer ranking–only reports HPL and HPCG—two benchmarks out of 20+ representative HPC benchmarks like HPCC, NPB.

  • Consider the full benchmarks, their subsets, and microbenchmarks as indispensable

    Different stages have conflicting benchmarking requirements. The initial design inputs to a new system/architecture need comprehensive workload characterization. For earlier-stage evaluations of a new system or architecture, which even adopts simulation-based methods, heavy benchmarking is a significant burden. Thus, concise, portable, and lightweight benchmarks are of great significance. While later-stage evaluations of a new architecture or system or purchasing a commercial off-the-shelf one need detailed evaluations using comprehensive benchmarks to avoid error-prone design or benchmarketing.

    For initial design inputs, we perform detailed workload characterization. For later-stage evaluations of or purchasing a new system/architecture, we run the full benchmarks or selectively run some benchmarks to locate the bottlenecks quickly. We run an affordable subset for earlier-stage evaluations of a new system/architecture or ranking commercial off-the-shelf systems/architectures.

Nineteen Representative AI Tasks

To cover a broad spectrum of representative AI tasks in Internet services, we thoroughly analyze the essential application scenarios among three primary Internet services, including search engines, social networks, and e-commerce, as shown in Table 2.

In total, we identify nineteen representative AI tasks, for each of which we implement a state-of-the-art model as a component benchmark, as shown in Table 3. Our benchmarks are constantly evolving and keep updating to use state-of-the-art models. For detailed descriptions of these tasks, please refer to AIBench Specification.

AIBench subsets

For performance ranking and workload characterization purposes, we choose two subsets.

How to Choose the Two Subsets?
We need to keep the subset repeatable, fair, affordable, and representative for the repeatable performance ranking purpose, so we keep the subset to a minimum from the following perspectives and criteria.

  • Reflecting diverse model complexity, computational cost, and convergent rate. Specifically, we intend to choose the benchmarks that cover different aspects as much as possible.
  • Run-to-run variation. Repeatability is an important selection criterion of the subset. To avoid too much run-to-run variation, we choose the benchmarks with variance under 2%.
  • Widely accepted evaluation metrics. A benchmark should have widely accepted performance metrics so that runs from different users have consistent termination conditions. So, we exclude the GAN-based models even if GANs are particularly important for content generation.

For the workload characterization purpose, we aim to select minimum workloads with the most representative system or micro-architectural characteristics, while the repeatability and consistent termination conditions are not mandatory.

The Subsets Decision
AIBench Training provides two subsets for repeatable performance ranking (RPR subset) and workload characterization (WC subset) to improve affordability.

  • RPR subset. The RPR subset includes three benchmarks: Image Classification, Object Detection, and Learning-to-Rank. To satisfy the first criterion, they cover different ranges of numbers in terms of FLOPs and learnable parameters (both small for Learning-to-Rank, medium for Image Classification, and large for Object Detection), and different convergent rates (small epochs for Object Detection, medium for Learning-to-Rank, and large for Image Classification). As for the second criterion, three benchmarks have the low run-to-run variation, 1.12% for Image Classification, 1.9% for Learning-to-Rank, and 0% for Object Detection. Besides, they have widely accepted evaluation metrics.
  • WC subset. The WC subset also includes three benchmarks: Spatial Transformer, Image-to-Text, and Speech-to-Text. This subset reflects the most representative micro-architectural workload characteristics since they are the nearest to the centroid of three clusters, respectively, according to the K-means clustering results on AIBench Training benchmarks.

Contributors

Prof. Jianfeng Zhan, ICT, Chinese Academy of Sciences, and BenchCouncil    
Dr. Wanling Gao, ICT, Chinese Academy of Sciences    
Fei Tang, ICT, Chinese Academy of Sciences    
Dr. Lei Wang, ICT, Chinese Academy of Sciences    
Xu Wen, ICT, Chinese Academy of Sciences    
Chuanxin Lan, ICT, Chinese Academy of Sciences    
Chunjie Luo, ICT, Chinese Academy of Sciences
Yunyou Huang, ICT, Chinese Academy of Sciences
Dr. Chen Zheng, ICT, Chinese Academy of Sciences, and BenchCouncil    
Dr. Zheng Cao, Alibaba     
Hainan Ye, Beijing Academy of Frontier Sciences and BenchCouncil     
Jiahui Dai, Beijing Academy of Frontier Sciences and BenchCouncil     
Daoyi Zheng, Baidu     
Haoning Tang, Tencent     
Kunlin Zhan, 58.com     
Biao Wang, NetEase     
Defei Kong, ByteDance     
Tong Wu, China National Institute of Metrology     
Minghe Yu, Zhihu     
Chongkang Tan, Lenovo     
Huan Li, Paypal     
Dr. Xinhui Tian, Moqi     
Yatao Li, Microsoft Research Asia     
Junchao Shao, JD.com     
Zhenyu Wang, CloudTa     
Xiaoyu Wang, Intellifusion     

Ranking

AIBench results are released.

License

AIBench is available for researchers interested in AI. Software components of AIBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use AIBench are required to fully understand and abide by the licensing terms of the various components. AIBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our AIBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our AIBench you must understand and comply with their licenses. Software developed externally (not by AIBench group)

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.