Home

News: Bench19 Call for Benchmarks (Deadline Extended to August 30, Denver, US). 2019 AI Competitions (500K RMB prize!), New papers on AIBench (TR, Bench18), HPC AI500, AIoT Bench, Edge AIBench. AI algorithm and system testbed online!

Summary

Today’s Internet Services are undergoing fundamental changes and shifting to an intelligent computing era where AI is widely employed to augment services. In this context, many innovative AI algorithms, systems, and architectures are proposed, and thus the importance of benchmarking and evaluating them rises. However, modern Internet services adopt a microservice-based architecture and consist of various modules. The diversity of these modules and complexity of execution paths, the massive scale and complex hierarchy of datacenter infrastructure, the confidential issues of data sets and workloads pose great challenges to benchmarking.

First, the real world data sets and workloads from Internet services are treated as first-class confidential issues by their providers, and they are isolated between academia and industry, or even among different providers. However, there are only a few publicly available performance model or observed insights about industry-scale Internet services that can be leveraged for further research. As there is no publicly available industry-scale Internet service benchmark, the state-of-the-art and state-of-the-practice are advanced only by the research staffs inside Internet service providers, which is not sustainable and poses a huge obstacle for our communities towards developing an open and mature research field.

Second, AI has infiltrated into almost all aspects of Internet services, ranging from offline analytics to online service. Thus, to cover the critical paths and characterize prominent characteristics of a realistic AI scenairo, end-to-end application benchmarks should be provided [8, 9]. Meanwhile, there are many classes of Internet services. Modern Internet services workloads expand and change very fast, and it is not scalable or even impossible to create a new benchmark or proxy for every possible workload. Moreover, data sets have great impacts on system and microarchitectural characteristics, so diverse data inputs should be considered. So we need identify representative data sets, abstract the prominent AI problem domains (component benchmarks), and further understand what are the most intensive units of computation (micro benchmarks), on the basis of which, we can build a concise and comprehensive AI benchmark framework.

Finally but not least, from an architectural perspective, porting a full-scale AI applications to a new architecture at an earlier stage is difficult and even impossible, while using micro or component benchmarks alone are insufficient to discover the time breakdown of different modules and locate the bottleneck within a realistic AI application scenario at a later stage. Hence, a realistic AI benchmark suite should have the ability to run not only collectively as a whole end-to-end application to discover the time breakdown of different modules but also individually as a micro or component benchmark for fine tuning hot spot functions or kernels. So an industry standard Internet service AI benchmark suite consisting of a full spectrum of micro or component benchmarks and an end-to-end application benchmark is of great significance to bridge this huge gap.

AIBench is the first industry scale AI benchmark suite, joint with seventeen industry partners. First, we present a highly extensible, configurable, and flexible benchmark framework, containing multiple loosely coupled modules like data input, prominent AI problem domains, online inference, offline training and automatic deployment tool modules. We analyze typical AI application scenarios from three most important Internet services domains, including search engine, social network, and e-commerce, and then we abstract and identify sixteen prominent AI problem domains, including classification, image generation, text-to-text translation, image-to-text, image-to- image, speech-to-text, face embedding, 3D face recognition, object detection, video prediction, image compression, recommendation, 3D object reconstruction, text summarization, spatial transformer, and learning to rank. We implement sixteen component benchmarks for those AI problem domains, and further profile and implement twelve fundamental units of computation across different component benchmarks as the micro benchmarks. On the basis of the AIBench framework, we design and implement the first end-to-end Internet service AI benchmark with an underlying e-commerce searching business model. As a whole, it covers the major modules and critical paths of an industry scale e-commerce provider. The application benchmark reuses ten component benchmarks from the AIBench framework, receives the query requests and performs personalized searching, recommendation and advertising, integrated with AI inference and training. The data maintains the real-world data characteristics through anonymization. Data generators are also provided to generate specified data scale, using several configurable parameters.

AIBench Framework

The AIBench framework provides an universal AI benchmark framework that is flexible and configurable, which is shown in Fig. 1. It provides loosely coupled modules that can be easily configured and extended to compose an end-to-end application, including the data input, AI problem domain, online inference, offline training, and deployment tool modules.
The data input module is responsible to feed data into the other modules. It collects representative real-world data sets from not only the authoritative public websites but also our industry partners after anonymization. The data schema is designed to maintain the real-world data characteristics, so as to alleviate the confidential issue. Based on the data schema, a series of data generators are further provided to support large-scale data generation, like the user or product information. To cover a wide spectrum of data characteristics, we take diverse data types, e.g., structured, semi-structured, un-structured, and different data sources, e.g., table, graph, text, image, audio, video, into account. Our framework integrates various open-source data storage systems, and supports large-scale data generation and deployment.
To achieve diversity and representativeness of our framework, we first identify prominent AI problem domains that play important roles in most important Internet services domains. And then we provide the concrete implementation of the AI algorithms targeting those AI problem domains as component bench- marks. Also, we profile the most intensive units of computation across those component benchmarks, and implement them as a set of micro benchmarks. Both micro and component benchmarks are implemented with the concern for composability, each of which can run collectively and individually.
The offline training and online inference modules are provided to construct an end-to-end application benchmark. First, the offline training module chooses one or more component benchmarks from the AI problem domain module, through specifying the required benchmark ID, input data, and execution parameters like batch size. Then the offline training module trains a model and provides the trained model to the online inference module. The online inference module loads the trained model onto the serving system, i.e., TensorFlow serving. Collaborating with the other non AI-related modules in the critical paths, an end-to-end application benchmark is built.
To be easily deployed on a large-scale cluster, the framework provides deployment tools that contain two automated deployment templates using Ansible and Kubernetes, respectively. Among them, the Ansible templates support scalable deployment on physical machines or virtual machines, while the kubernetes templates are used to deploy on container clusters. A configuration file needs to be specified for installation and deployment, including module parameters like the chosen benchmark ID, input data, and the cluster parameters like nodes, memory, and network information.

Figure 1 AIBench Framework.

The Prominent AI Problem Domains

To cover a wide spectrum of prominent AI problem domains among Internet services, we thoroughly analyze the core scenarios among three primary Internet services, including search engine, social network, and e-commerce, as shown in Table 1. In total, we identify sixteen representative AI problem domains as follows.

Table 1 Prominent AI Problem Domains.

Micro, Component, Application Benchmarks in AIBench

Datacenter AI benchmarks---AIBench---provides a scalable and comprehensive datacenter AI benchmark suite, covering 16 problem domains. Totally, AIBench consists of 12 micro benchmarks (as shown in Table 2), each of which is a unit of computation implementation, 16 component benchmarks (as shown in Table 3), each of which is the combination of different units of computation, and 2 end-to-end application benchmarks: DCMix---a datacenter AI application combination mixed with AI workloads, and E-commerce AI---an end-to-end business AI benchmark. To cover a full spectrum of data characteristics, AIBench collects 16 representative data sets for datacenter AI benchmarks. The benchmarks are implemented not only based on main-stream deep learning frameworks like TensorFlow and PyTorch, but also based on traditional programming model like Pthreads, to conduct an apple-to-apple comparison.

Table 2. Micro Benchmark

No.

Micro Benchmark

Involved Data Motif

Data Set

Software Stack

DC-AI-M1

Convolution

Transform

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M2

Fully Connected

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M3

Relu

Logic

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M4

Sigmoid

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M5

Tanh

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M6

MaxPooling

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M7

AvgPooling

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M8

CosineNorm

Basic Statistics

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M9

BatchNorm

Basic Statistics

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M10

Dropout

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M11

Element-wise multiply

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M12

Softmax

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

Table 3. Component Benchmark

No.

Component Benchmark

Algorithm

Data Set

Software Stack

DC-AI-C1

Image classification

ResNet50

ImageNet

TensorFlow, PyTorch

DC-AI-C2

Image generation

WassersteinGAN

LSUN

TensorFlow, PyTorch

DC-AI-C3

Text-to-Text trans- lation

Recurrent neural networks

WMT English- German

TensorFlow, PyTorch

DC-AI-C4

Image-to-Text

Neural Image Caption Model

Microsoft COCO

TensorFlow, PyTorch

DC-AI-C5

Image-to-Image

CycleGAN

Cityscapes

TensorFlow, PyTorch

DC-AI-C6

Speech-to-Text

DeepSpeech2

Librispeech

TensorFlow, PyTorch

DC-AI-C7

Face embedding

Facenet

LFW, VGGFace2

TensorFlow, PyTorch

DC-AI-C8

3D Face Recognition

3D face models

77,715 samples from 253 face IDs

TensorFlow, PyTorch

DC-AI-C9

Object detection

Faster R-CNN

Microsoft COCO

TensorFlow, PyTorch

DC-AI-C10

Recommendation

Collaborative filtering

MovieLens

TensorFlow, PyTorch

DC-AI-C11

Video prediction

Motion-Focused predictive models

Robot pushing dataset

TensorFlow, PyTorch

DC-AI-C12

Image compression

Recurrent neural network

ImageNet

TensorFlow, PyTorch

DC-AI-C13

3D ob ject reconstruction

Convolutional encoder-decoder network

ShapeNet Dataset

TensorFlow, PyTorch

DC-AI-C14

Text summarization

Sequence-to-sequence model

Gigaword dataset

TensorFlow, PyTorch

DC-AI-C15

Spatial transformer

Spatial transformer networks

MNIST

TensorFlow, PyTorch

DC-AI-C16

Learning to rank

Ranking distillation

Gowalla

TensorFlow, PyTorch

Application Benchmark

DCMix
Modern datacenter computer systems are widely deployed with mixed workloads to improve system utilization and save cost. However, the throughput of latency-critical workloads is dominated by their worst-case performance-tail latency. To model this important application scenario, we propose an end-to-end application benchmark---DCMix to generate mixed workloads whose latencies range from microseconds to minutes with four mixed execution modes.

E-commerce search
Modern Internet services workloads are notoriously complex in terms of industry-scale architecture fueled with machine learning algorithms. As a joint work with Alibaba, we release an end-to-end application benchmark---E-commerce Search to mimic complex modern Internet services workloads.

Metrics

AIBench focuses on a series of metrics covering accuracy, performance, and energy consumption, which are major industry concerns. The metrics for online inference contains query response latency, tail latency, and throughput from performance aspect, inference accuracy, and inference energy consumption.

The metrics for offline training contains the samples processed per second, the wall clock time to train the specific epochs, the wall clock time to train a model achieving a target accuracy, and the energy consumption to train a model achieving a target accuracy.

Data Model

To cover a diversity of data sets from various applications, we collect 16 representative data sets, including ImageNet, Cifar, LSUN, WMT English-German, Cityscapes, LibriSpeech, Microsoft COCO data set, LFW, VGGFace2, Robot pushing data set, MovieLens data set, ShapeNet data set, Gigaword data set, MNIST data set, Gowalla data set, and the 3D face recognition data set from our industry partner.

Contributors

Prof. Jianfeng Zhan, ICT, Chinese Academy of Sciences, and BenchCouncil    
Dr. Wanling Gao, ICT, Chinese Academy of Sciences    
Fei Tang, ICT, Chinese Academy of Sciences    
Dr. Lei Wang, ICT, Chinese Academy of Sciences    
Chuanxin Lan, ICT, Chinese Academy of Sciences    
Chunjie Luo, ICT, Chinese Academy of Sciences
Yunyou Huang, ICT, Chinese Academy of Sciences
Dr. Chen Zheng, ICT, Chinese Academy of Sciences, and BenchCouncil    
Dr. Zheng Cao, Alibaba     
Hainan Ye, Beijing Academy of Frontier Sciences and BenchCouncil     
Jiahui Dai, Beijing Academy of Frontier Sciences and BenchCouncil     
Daoyi Zheng, Baidu     
Haoning Tang, Tencent     
Kunlin Zhan, 58.com     
Biao Wang, NetEase     
Defei Kong, ByteDance     
Tong Wu, China National Institute of Metrology     
Minghe Yu, Zhihu     
Chongkang Tan, Lenovo     
Huan Li, Paypal     
Dr. Xinhui Tian, Moqi     
Yatao Li, Microsoft Research Asia     
Dr. Gang Lu, Huawei     
Junchao Shao, JD.com     
Zhenyu Wang, CloudTa     
Xiaoyu Wang, Intellifusion     

Numbers

Benchmarking results are available soon.

License

AIBench is available for researchers interested in AI. Software components of AIBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use AIBench are required to fully understand and abide by the licensing terms of the various components. AIBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our AIBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our AIBench you must understand and comply with their licenses. Software developed externally (not by AIBench group)

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.