AIBench | Scalable and Comprehensive AI Benchmarking for Datacenter, HPC, IoT and Edge, BenchCouncil

News: Bench19 Call for Papers (Denver, US, Prior to SC 2019). Data Motif published on PACT'18. Big Data and AI Proxy Benchmarks published on IISWC'18. Technical Report on BOPS: new metric for Datacenter Computing.

Summary

BenchCouncil publishes two scalable and comprehensive benchmarking suites: BigDataBench for big data benchmarking and AIBench for AI benchmarking, joint with ICT, CAS, Beijing Academy of Frontier Sciences and Technology, Alibaba, National Supercomputing Center, Tencent, Baidu, Wuba, Yahoo!, DropBox, Tsinghua University, Capital University of Medical Sciences, Princeton University, etc. In total, AIBench consists of four AI benchmarking projects:

DC AIBench for datacenter AI benchmarking;
HPC AI500 for benchmarking HPC AI systems;
AIoT Bench for benchmarking mobile and embedded device intelligence;
Edge AIBench for end-to-end edge computing benchmarking.

Benchmark Methodology

We specify the common requirements of Big Data and AI only algorithmically in a paper-and-pencil approach, reasonably divorced from individual implementations. We capture the differences and collaborations among IoT, edge, datacenter and HPC in handling Big Data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on initial or intermediate data inputs, each of which we call a data motif . For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs (PACT 18 paper)— including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic computation, each of which captures the common requirements of each class of unit of computation. Other than creating a new benchmark or proxy for every possible workload, we propose using data motif-based benchmarks—the combination of eight data motifs—to represent diversity of big data and AI workloads. Figure 1 summarizes our data motif-based scalable benchmarking methodology.

Figure 1 BigDataBench Benchmarking Methodology.

(1) DC AIBench: Datacenter AI Benchmarks

Datacenter AI benchmarks---DC AIBench---provides a scalable and comprehensive datacenter AI benchmark suite, covering 15 problem domains---image classification, image generation, text-to-text translation, image-to-text, image-to-image, speech-to-text, face embedding, 3D face recognition, object detection, video prediction, image compression, recommendation, 3D object reconstruction, text summarization, and spatial transformer. Totally, DC AIBench consists of 10 micro benchmarks (as shown in Table 1), each of which is a unit of computation implementation, 15 component benchmarks (as shown in Table 2), each of which is the combination of different units of computation, and 2 end-to-end application benchmarks: DCMix---a datacenter AI application combination mixed with AI workloads, and E-commerce AI---an end-to-end business AI benchmark. To cover a full spectrum of data characteristics, DC AIBench collects 15 representative data sets for datacenter AI benchmarks. The benchmarks are implemented not only based on main-stream deep learning frameworks like TensorFlow and PyTorch, but also based on traditional programming model like Pthreads, to conduct an apple-to-apple comparison.

Micro Benchmark

No.	Micro Benchmark	Involved Data Motif	Data Set	Software Stack
DC-AI-M1	Convolution	Transform	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M2	Fully Connected	Matrix	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M3	Relu	Logic	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M4	Sigmoid	Matrix	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M5	Tanh	Matrix	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M6	MaxPooling	Sampling	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M7	AvgPooling	Sampling	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M8	CosineNorm	Basic Statistics	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M9	BatchNorm	Basic Statistics	Cifar, ImageNet	TensorFlow, Ptheads
DC-AI-M10	Dropout	Sampling	Cifar, ImageNet	TensorFlow, Ptheads

Component Benchmark

No.	Component Benchmark	Algorithm	Data Set	Software Stack
DC-AI-C1	Image classification	ResNet50	ImageNet	TensorFlow, PyTorch
DC-AI-C2	Image generation	WassersteinGAN	LSUN	TensorFlow, PyTorch
DC-AI-C3	Text-to-Text trans- lation	Recurrent neural networks	WMT English- German	TensorFlow, PyTorch
DC-AI-C4	Image-to-Text	Neural Image Caption Model	Microsoft COCO	TensorFlow, PyTorch
DC-AI-C5	Image-to-Image	CycleGAN	Cityscapes	TensorFlow, PyTorch
DC-AI-C6	Speech-to-Text	DeepSpeech2	Librispeech	TensorFlow, PyTorch
DC-AI-C7	Face embedding	Facenet	LFW, VGGFace2	TensorFlow, PyTorch
DC-AI-C8	3D Face Recognition	3D face models	77,715 samples from 253 face IDs	TensorFlow, PyTorch
DC-AI-C9	Object detection	Faster R-CNN	Microsoft COCO	TensorFlow, PyTorch
DC-AI-C10	Recommendation	Collaborative filtering	MovieLens	TensorFlow, PyTorch
DC-AI-C11	Video prediction	Motion-Focused predictive models	Robot pushing dataset	TensorFlow, PyTorch
DC-AI-C12	Image compression	Recurrent neural network	ImageNet	TensorFlow, PyTorch
DC-AI-C13	3D ob ject reconstruction	Convolutional encoder-decoder network	ShapeNet Dataset	TensorFlow, PyTorch
DC-AI-C14	Text summarization	Sequence-to-sequence model	Gigaword dataset	TensorFlow, PyTorch
DC-AI-C15	Spatial transformer	Spatial transformer networks	MNIST	TensorFlow, PyTorch

Application Benchmark

DCMix
Modern datacenter computer systems are widely deployed with mixed workloads to improve system utilization and save cost. However, the throughput of latency-critical workloads is dominated by their worst-case performance-tail latency. To model this important application scenario, we propose an end-to-end application benchmark---DCMix to generate mixed workloads whose latencies range from microseconds to minutes with four mixed execution modes.

E-commerce search
Modern Internet services workloads are notoriously complex in terms of industry-scale architecture fueled with machine learning algorithms. As a joint work with Alibaba, we release an end-to-end application benchmark---E-commerce Search to mimic complex modern Internet services workloads.

(2) AIoT Bench: Benchmarking for Mobile and Embedded device Intelligence

Due to increasing amounts of data and compute resources, the deep learning achieves many successes in various domains. Recently, researchers and engineers make effort to apply the intelligent algorithms to the mobile or embedded devices, e.g. smart phone, self-driving cars, smart home. On one hand, the neural networks are made more lightweight to adapt the mobile or embedded devices by using simpler architecture, or by quantizing, pruning and compressing the networks. On the other hand, the mobile and embedded devices provide additional hardware acceleration using GPUs or NPUs to support the AI applications. Since AI applications on mobile and embedded devices get more and more attention, the benchmarking of the AI ability of those devices becomes an urgent problem to be solved.
AIoT Bench, is a comprehensive benchmark suite to evaluate the AI ability of mobile and embedded devices. Our benchmark 1) covers different application domains, e.g. image recognition, speech recognition and natural language processing; 2) covers different platforms, including Android devices and Raspberry Pi; 3) covers different development tools, including TensorFlow and Caffe2; 4) offers both end-to-end application workloads and micro workloads.

The workloads in AIoT Bench are implemented using both TensorFlow Lite and Caffe 2 on the platform of Android as well as Raspberry Pi. Only the prediction procedure are included since the training are usually carried out on datacenters.
Image classification workload. This is an end-to-end application workload of vision domain, which takes an image as input and outputs the image label. The model we use for image classification is MobileNet, which is a light weight convolutional network designed for mobile and embedded devices.
Speech recognition workload. This is an end-to-end application workload of speech domain, which takes words and phrases in a spoken language as input and converts them to the text format. The model we use is the DeepSpeech 2, which consists of 2 convolutional layers, 5 bidirectional RNN layers, and a fully connected layer.
Transformer translation workload. This is an end-to-end application workload of NLP domain, which takes the text of one language as input and translates into another language. The model we use is transformer translation model, which solves sequence to sequence problems using attention mechanisms without recurrent connections used in traditional neural seq2seq models.
Micro workloads. In our benchmarks, we provide the micro workloads, which are the basic operations to compose different networks. In detail, the micro workloads include convolutional operation, pointwise convolution, depthwise convolution, matrix multiply, pointwise add, ReLU activation, sigmoid activation, max pooling, average pooling.

(3) HPC AI500: A Benchmark Suite for HPC AI Systems

In recent years, with the trend of applying deep learning (DL) in scientic computing, the physical simulation is no longer the only class of problems to be solved in the HPC community. The unique characteristics of emerging scientic DL workloads raise great challenges in benchmarking and thus the community needs a new yard stick for evaluating the future HPC systems.
Consequently, we propose HPC AI500---a benchmark suite for evaluating HPC systems that running scientic DL workloads. Each workload from HPC AI500 bases on real scientic DL applications and covers the most representative scientic fields. Currently, we choose 14 scientic DL benchmarks from application scenarios, datasets, and software stack. Furthermore, we propose a set of metrics of comprehensively evaluating the HPC systems, considering both accuracy, performance as well as power and cost. In addition, we provide a scalable reference implementation of HPC AI500.

Component Benchmarks. Since object detection, image recognition, and image generation are the most representative DL tasks in modern scientific DL. We choose the following state-of-the-art models as the HPC AI500 component benchmarks.
1) Faster-RCNN. This benchmark targets real-time object detection. Unlike the previous object detection model, it replaces the selective search by a region proposal network that achieves nearly cost-free region proposals. Further more, Faster-RCNN combines the advanced CNN model as their base network for extracting features and is the foundation of the 1st-place winning entries in ILSVRC’15 (ImageNet Large Scale Visual Recognition Competition).
2) ResNet. This benchmark is a milestone in Image Recognition, marking the ability of AI to identify images beyond humans. It solves the degradation problem, which means in the very deep neural network the gradient will gradually disappear in the process of propagation, leading to poor performance. Due to the idea of ResNet, researchers successfully build a 152-layer deep CNN. This ultra deep model won all the awards in ILSVRC’15.
3) DCGAN. This benchmark is one of the popular and successful neural network for GAN. Its fundamental idea is replacing fully connected layers with convolutions and using transposed convolution for upsampling. The proposal of DCGAN helps bride the gap between CNNs for supervised learning and unsupervised learning.

Micro Benchmarks. We choose the following primary operators in CNN as HPC AI500 micro benchmarks.
1) Convolution. In mathematics, convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other. In a CNN, convolution is the operation occupying the largest proportion, which is the multiply accumulate of the input matrix and the convolution kernel, and then produces feature maps. There are many convolution kernels distributed in different layers responsible for learning different level features.
2) Full-connected. The full-connected layer can be seen as the classifier of a CNN, which is essentially matrix multiplication. It is also the cause of the explosion of CNN parameters. For example, in AlexNet, the number of training parameters of fully-connected layers reaches about 59 million and accounts for 94 percent of the total.
3) Pooling. Pooling is a sample-based discretization process. In a CNN, the objective of pooling is to down-sample the inputs (e.g., feature maps), which leads to the reduction of dimensionality and training parameters. In addition, it enhances the robustness of the whole network. The commonly used pooling operations including max-pooling and average-pooling.

(4) Edge AIBench: Comprehensive End-to-end Edge Computing Benchmarking

The rapid growth of the number of the client side devices brings great challenges to computing power, data storage, and network, considering the variety and quantity of these devices. The traditional way to solve these problems is the cloud computing framework. And with the development of 5G network technology, the Internet of Things(IoT) has been another solution. Moreover, edge computing shows a trend of rapid development these years, which combines cloud computing and IoT framework. In the edge computing scenarios, the distribution of data and collaboration of workloads on different layers are serious concerns for performance, security, and privacy issues. So for edge computing benchmarking, we must take an end-to-end view, considering all three layers: client-side devices, edge computing layer, and cloud servers.
Edge AIBench is a benchmark suite for end-to-end edge computing including four typical application scenarios: ICU Patient Monitor, Surveillance Camera, Smart Home, and Autonomous Vehicle, which consider the complexity of all edge computing AI scenarios. In addition, Edge AIBench provides an end-to-end application benchmarking framework, including train, validate and inference stages. Table 3 shows the component benchmarks of Edge AIBench. Edge AIBench provides an end-to-end application benchmarking, consisting of train, inference, data collection and other parts using a general three-layer edge computing framework.

ICU Patient Monitor. ICU is the treatment place for critical patients. Therefore immediacy is significant for ICU patient monitor scenario to notify doctors of the patients’ status as soon as possible. The dataset we use is MIMIC-III. MIMIC-III provides many kinds of patients data such as vital signs, fluid balance and so on. Moreover, we choose heart failure prediction and endpoint prediction as the AI benchmarks.
Surveillance Camera. There are many surveillance cameras all over the world nowadays, and these cameras will produce a large quantity of video data at all times. If we transmit all of the data to cloud servers, the network transmission bandwidth will be very high. Therefore, this scenario focus on edge data preprocesses and data compression.
Smart Home. Smart home includes a lot of smart home devices such as automatic controller, alarm system, audio equipment and so on. Thus, the uniqueness of the smart home includes different kinds of edge devices and heterogeneous data. We will choose two AI applications as the component benchmarks: speech recognition and face recognition. These two components have heterogeneous data and different collecting devices. These two component benchmarks both collect data on the client side devices (e.g., camera and smartphone), infer on the edge computing layer and train on the cloud server.
Autonomous Vehicle. The uniqueness of the autonomous vehicle scenario is that the high demand for validity. That is to say, it takes absolute correct action even without human intervention. This feature represents the demand of some edge computing AI scenarios. The automatic control system will analyze the current road conditions and make a corresponding reaction at once. We choose the road sign recognition as the component benchmark.

A Federated Learning Framework Testbed. We have developed an edge computing AI testbed to provide support for researchers and common users, which is publicly available from /testbed.html. Security and privacy issues become significant focuses in the age of big data, as well as edge computing. Federated learning is a distributed collaborative machine learning technology whose main target is to preserve the privacy. Our testbed system will combine the federated learning framework.

Table 3. The summary of Edge AIBench

End-to-end Application Scenarios	Component Benchmarks	Cloud Server	Edge Computing Layer	Client Side Device
ICU Patient Monitor	Heart Failure Prediction	Train	Infer Alarm	Generate Data
ICU Patient Monitor	Endpoint Prediction	Train	Infer	Generate Data
Surveillance Camera	Person Re-Identification	Decompress Data Train	Compress Data Infer	Generate Data
Smart Home	Speech Recognition	Train	Infer	Generate Data
Smart Home	Face Recognition	Train	Infer	Generate Data
Autonomous Vehicle	Road Sign Recognition	Train	Infer	Generate Data

Contributors

Prof. Jianfeng Zhan, ICT, Chinese Acadmey of Sciences, and BenchCouncil
Dr. Wanling Gao, ICT, Chinese Acadmey of Sciences
Dr. Lei Wang, ICT, Chinese Academy of Sciences
Chunjie Luo, ICT, Chinese Academy of Sciences
Tianshu Hao, ICT, Chinese Academy of Sciences
Zihan Jiang, ICT, Chinese Academy of Sciences
Yunyou Huang, ICT, Chinese Academy of Sciences
Dr. Chen Zheng, ICT, Chinese Academy of Sciences, and BenchCouncil
Dr. Zheng Cao, Alibaba
Hainan Ye, Beijing Academy of Frontier Sciences and BenchCouncil
Dr. Zhen Jia, Princeton University and BenchCouncil
Daoyi Zheng, Baidu
Shujie Zhang, Huawei
Haoning Tang, Tencent
Dr. Yingjie Shi
Zijian Ming, Tencent
Yuanqing Guo, Sohu
Yongqiang He, Dropbox
Kent Zhan, Tencent (Previously), WUBA(Currently)
Xiaona Li, Baidu
Bizhu Qiu, Yahoo!
Qiang Yang, BAFST
Jingwei Li, BAFST
Dr. Xinhui Tian, ICT, CAS
Dr. Gang Lu, BAFST
Xinlong Lin, BAFST
Rui Ren, ICT, CAS
Dr. Rui Han, ICT, CAS

Numbers

Benchmarking results are available soon.

License

AIBench is available for researchers interested in AI. Software components of AIBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use AIBench are required to fully understand and abide by the licensing terms of the various components. AIBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our AIBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our AIBench you must understand and comply with their licenses. Software developed externally (not by AIBench group)

TensorFlow: https://www.tensorflow.org
PyTorch: https://pytorch.org/
Caffe2: http://caffe2.ai

Redistribution of source code must comply with the license and notice disclaimers
Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.