Home

News: Bench19 Call for Papers (Denver, US, Prior to SC 2019). Data Motif published on PACT'18. Big Data and AI Proxy Benchmarks published on IISWC'18. Technical Report on BOPS: new metric for Datacenter Computing.

Summary

BenchCouncil publishes two scalable and comprehensive benchmarking suites: BigDataBench for big data benchmarking and AIBench for AI benchmarking, joint with ICT, CAS, Beijing Academy of Frontier Sciences and Technology, Alibaba, National Supercomputing Center, Tencent, Baidu, Wuba, Yahoo!, DropBox, Tsinghua University, Capital University of Medical Sciences, Princeton University, etc. In total, AIBench consists of four AI benchmarking projects:

  • DC AIBench for datacenter AI benchmarking;
  • HPC AI500 for benchmarking HPC AI systems;
  • AIoT Bench for benchmarking mobile and embedded device intelligence;
  • Edge AIBench for end-to-end edge computing benchmarking.

Benchmark Methodology

We specify the common requirements of Big Data and AI only algorithmically in a paper-and-pencil approach, reasonably divorced from individual implementations. We capture the differences and collaborations among IoT, edge, datacenter and HPC in handling Big Data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on initial or intermediate data inputs, each of which we call a data motif . For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs (PACT 18 paper)— including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic computation, each of which captures the common requirements of each class of unit of computation. Other than creating a new benchmark or proxy for every possible workload, we propose using data motif-based benchmarks—the combination of eight data motifs—to represent diversity of big data and AI workloads. Figure 1 summarizes our data motif-based scalable benchmarking methodology.

Figure 1 BigDataBench Benchmarking Methodology.

(1) DC AIBench: Datacenter AI Benchmarks

Datacenter AI benchmarks---DC AIBench---provides a scalable and comprehensive datacenter AI benchmark suite, covering 15 problem domains---image classification, image generation, text-to-text translation, image-to-text, image-to-image, speech-to-text, face embedding, 3D face recognition, object detection, video prediction, image compression, recommendation, 3D object reconstruction, text summarization, and spatial transformer. Totally, DC AIBench consists of 10 micro benchmarks (as shown in Table 1), each of which is a unit of computation implementation, 15 component benchmarks (as shown in Table 2), each of which is the combination of different units of computation, and 2 end-to-end application benchmarks: DCMix---a datacenter AI application combination mixed with AI workloads, and E-commerce AI---an end-to-end business AI benchmark. To cover a full spectrum of data characteristics, DC AIBench collects 15 representative data sets for datacenter AI benchmarks. The benchmarks are implemented not only based on main-stream deep learning frameworks like TensorFlow and PyTorch, but also based on traditional programming model like Pthreads, to conduct an apple-to-apple comparison.

Micro Benchmark

No.

Micro Benchmark

Involved Data Motif

Data Set

Software Stack

DC-AI-M1

Convolution

Transform

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M2

Fully Connected

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M3

Relu

Logic

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M4

Sigmoid

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M5

Tanh

Matrix

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M6

MaxPooling

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M7

AvgPooling

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M8

CosineNorm

Basic Statistics

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M9

BatchNorm

Basic Statistics

Cifar, ImageNet

TensorFlow, Ptheads

DC-AI-M10

Dropout

Sampling

Cifar, ImageNet

TensorFlow, Ptheads

Component Benchmark

No.

Component Benchmark

Algorithm

Data Set

Software Stack

DC-AI-C1

Image classification

ResNet50

ImageNet

TensorFlow, PyTorch

DC-AI-C2

Image generation

WassersteinGAN

LSUN

TensorFlow, PyTorch

DC-AI-C3

Text-to-Text trans- lation

Recurrent neural networks

WMT English- German

TensorFlow, PyTorch

DC-AI-C4

Image-to-Text

Neural Image Caption Model

Microsoft COCO

TensorFlow, PyTorch

DC-AI-C5

Image-to-Image

CycleGAN

Cityscapes

TensorFlow, PyTorch

DC-AI-C6

Speech-to-Text

DeepSpeech2

Librispeech

TensorFlow, PyTorch

DC-AI-C7

Face embedding

Facenet

LFW, VGGFace2

TensorFlow, PyTorch

DC-AI-C8

3D Face Recognition

3D face models

77,715 samples from 253 face IDs

TensorFlow, PyTorch

DC-AI-C9

Object detection

Faster R-CNN

Microsoft COCO

TensorFlow, PyTorch

DC-AI-C10

Recommendation

Collaborative filtering

MovieLens

TensorFlow, PyTorch

DC-AI-C11

Video prediction

Motion-Focused predictive models

Robot pushing dataset

TensorFlow, PyTorch

DC-AI-C12

Image compression

Recurrent neural network

ImageNet

TensorFlow, PyTorch

DC-AI-C13

3D ob ject reconstruction

Convolutional encoder-decoder network

ShapeNet Dataset

TensorFlow, PyTorch

DC-AI-C14

Text summarization

Sequence-to-sequence model

Gigaword dataset

TensorFlow, PyTorch

DC-AI-C15

Spatial transformer

Spatial transformer networks

MNIST

TensorFlow, PyTorch

Application Benchmark

DCMix
Modern datacenter computer systems are widely deployed with mixed workloads to improve system utilization and save cost. However, the throughput of latency-critical workloads is dominated by their worst-case performance-tail latency. To model this important application scenario, we propose an end-to-end application benchmark---DCMix to generate mixed workloads whose latencies range from microseconds to minutes with four mixed execution modes.

E-commerce search
Modern Internet services workloads are notoriously complex in terms of industry-scale architecture fueled with machine learning algorithms. As a joint work with Alibaba, we release an end-to-end application benchmark---E-commerce Search to mimic complex modern Internet services workloads.

(2) AIoT Bench: Benchmarking for Mobile and Embedded device Intelligence

Due to increasing amounts of data and compute resources, the deep learning achieves many successes in various domains. Recently, researchers and engineers make effort to apply the intelligent algorithms to the mobile or embedded devices, e.g. smart phone, self-driving cars, smart home. On one hand, the neural networks are made more lightweight to adapt the mobile or embedded devices by using simpler architecture, or by quantizing, pruning and compressing the networks. On the other hand, the mobile and embedded devices provide additional hardware acceleration using GPUs or NPUs to support the AI applications. Since AI applications on mobile and embedded devices get more and more attention, the benchmarking of the AI ability of those devices becomes an urgent problem to be solved.
AIoT Bench, is a comprehensive benchmark suite to evaluate the AI ability of mobile and embedded devices. Our benchmark 1) covers different application domains, e.g. image recognition, speech recognition and natural language processing; 2) covers different platforms, including Android devices and Raspberry Pi; 3) covers different development tools, including TensorFlow and Caffe2; 4) offers both end-to-end application workloads and micro workloads.

The workloads in AIoT Bench are implemented using both TensorFlow Lite and Caffe 2 on the platform of Android as well as Raspberry Pi. Only the prediction procedure are included since the training are usually carried out on datacenters.
Image classification workload. This is an end-to-end application workload of vision domain, which takes an image as input and outputs the image label. The model we use for image classification is MobileNet, which is a light weight convolutional network designed for mobile and embedded devices.
Speech recognition workload. This is an end-to-end application workload of speech domain, which takes words and phrases in a spoken language as input and converts them to the text format. The model we use is the DeepSpeech 2, which consists of 2 convolutional layers, 5 bidirectional RNN layers, and a fully connected layer.
Transformer translation workload. This is an end-to-end application workload of NLP domain, which takes the text of one language as input and translates into another language. The model we use is transformer translation model, which solves sequence to sequence problems using attention mechanisms without recurrent connections used in traditional neural seq2seq models.
Micro workloads. In our benchmarks, we provide the micro workloads, which are the basic operations to compose different networks. In detail, the micro workloads include convolutional operation, pointwise convolution, depthwise convolution, matrix multiply, pointwise add, ReLU activation, sigmoid activation, max pooling, average pooling.

(3) HPC AI500: A Benchmark Suite for HPC AI Systems

In recent years, with the trend of applying deep learning (DL) in scientic computing, the physical simulation is no longer the only class of problems to be solved in the HPC community. The unique characteristics of emerging scientic DL workloads raise great challenges in benchmarking and thus the community needs a new yard stick for evaluating the future HPC systems.
Consequently, we propose HPC AI500---a benchmark suite for evaluating HPC systems that running scientic DL workloads. Each workload from HPC AI500 bases on real scientic DL applications and covers the most representative scientic fields. Currently, we choose 14 scientic DL benchmarks from application scenarios, datasets, and software stack. Furthermore, we propose a set of metrics of comprehensively evaluating the HPC systems, considering both accuracy, performance as well as power and cost. In addition, we provide a scalable reference implementation of HPC AI500.

Component Benchmarks. Since object detection, image recognition, and image generation are the most representative DL tasks in modern scientific DL. We choose the following state-of-the-art models as the HPC AI500 component benchmarks.
1) Faster-RCNN. This benchmark targets real-time object detection. Unlike the previous object detection model, it replaces the selective search by a region proposal network that achieves nearly cost-free region proposals. Further more, Faster-RCNN combines the advanced CNN model as their base network for extracting features and is the foundation of the 1st-place winning entries in ILSVRC’15 (ImageNet Large Scale Visual Recognition Competition).
2) ResNet. This benchmark is a milestone in Image Recognition, marking the ability of AI to identify images beyond humans. It solves the degradation problem, which means in the very deep neural network the gradient will gradually disappear in the process of propagation, leading to poor performance. Due to the idea of ResNet, researchers successfully build a 152-layer deep CNN. This ultra deep model won all the awards in ILSVRC’15.
3) DCGAN. This benchmark is one of the popular and successful neural network for GAN. Its fundamental idea is replacing fully connected layers with convolutions and using transposed convolution for upsampling. The proposal of DCGAN helps bride the gap between CNNs for supervised learning and unsupervised learning.

Micro Benchmarks. We choose the following primary operators in CNN as HPC AI500 micro benchmarks.
1) Convolution. In mathematics, convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other. In a CNN, convolution is the operation occupying the largest proportion, which is the multiply accumulate of the input matrix and the convolution kernel, and then produces feature maps. There are many convolution kernels distributed in different layers responsible for learning different level features.
2) Full-connected. The full-connected layer can be seen as the classifier of a CNN, which is essentially matrix multiplication. It is also the cause of the explosion of CNN parameters. For example, in AlexNet, the number of training parameters of fully-connected layers reaches about 59 million and accounts for 94 percent of the total.
3) Pooling. Pooling is a sample-based discretization process. In a CNN, the objective of pooling is to down-sample the inputs (e.g., feature maps), which leads to the reduction of dimensionality and training parameters. In addition, it enhances the robustness of the whole network. The commonly used pooling operations including max-pooling and average-pooling.

(4) Edge AIBench: Comprehensive End-to-end Edge Computing Benchmarking

The rapid growth of the number of the client side devices brings great challenges to computing power, data storage, and network, considering the variety and quantity of these devices. The traditional way to solve these problems is the cloud computing framework. And with the development of 5G network technology, the Internet of Things(IoT) has been another solution. Moreover, edge computing shows a trend of rapid development these years, which combines cloud computing and IoT framework. In the edge computing scenarios, the distribution of data and collaboration of workloads on different layers are serious concerns for performance, security, and privacy issues. So for edge computing benchmarking, we must take an end-to-end view, considering all three layers: client-side devices, edge computing layer, and cloud servers.
Edge AIBench is a benchmark suite for end-to-end edge computing including four typical application scenarios: ICU Patient Monitor, Surveillance Camera, Smart Home, and Autonomous Vehicle, which consider the complexity of all edge computing AI scenarios. In addition, Edge AIBench provides an end-to-end application benchmarking framework, including train, validate and inference stages. Table 3 shows the component benchmarks of Edge AIBench. Edge AIBench provides an end-to-end application benchmarking, consisting of train, inference, data collection and other parts using a general three-layer edge computing framework.

ICU Patient Monitor. ICU is the treatment place for critical patients. Therefore immediacy is significant for ICU patient monitor scenario to notify doctors of the patients’ status as soon as possible. The dataset we use is MIMIC-III. MIMIC-III provides many kinds of patients data such as vital signs, fluid balance and so on. Moreover, we choose heart failure prediction and endpoint prediction as the AI benchmarks.
Surveillance Camera. There are many surveillance cameras all over the world nowadays, and these cameras will produce a large quantity of video data at all times. If we transmit all of the data to cloud servers, the network transmission bandwidth will be very high. Therefore, this scenario focus on edge data preprocesses and data compression.
Smart Home. Smart home includes a lot of smart home devices such as automatic controller, alarm system, audio equipment and so on. Thus, the uniqueness of the smart home includes different kinds of edge devices and heterogeneous data. We will choose two AI applications as the component benchmarks: speech recognition and face recognition. These two components have heterogeneous data and different collecting devices. These two component benchmarks both collect data on the client side devices (e.g., camera and smartphone), infer on the edge computing layer and train on the cloud server.
Autonomous Vehicle. The uniqueness of the autonomous vehicle scenario is that the high demand for validity. That is to say, it takes absolute correct action even without human intervention. This feature represents the demand of some edge computing AI scenarios. The automatic control system will analyze the current road conditions and make a corresponding reaction at once. We choose the road sign recognition as the component benchmark.

A Federated Learning Framework Testbed. We have developed an edge computing AI testbed to provide support for researchers and common users, which is publicly available from http://www.benchcouncil.org/testbed.html. Security and privacy issues become significant focuses in the age of big data, as well as edge computing. Federated learning is a distributed collaborative machine learning technology whose main target is to preserve the privacy. Our testbed system will combine the federated learning framework.

Table 3. The summary of Edge AIBench

End-to-end Application Scenarios

Component Benchmarks

Cloud Server

Edge Computing Layer

Client Side Device

ICU Patient Monitor

Heart Failure Prediction

Train

Infer
Alarm

Generate Data

ICU Patient Monitor

Endpoint Prediction

Train

Infer

Generate Data

Surveillance Camera

Person Re-Identification

Decompress Data
Train

Compress Data
Infer

Generate Data

Smart Home

Speech Recognition

Train

Infer

Generate Data

Smart Home

Face Recognition

Train

Infer

Generate Data

Autonomous Vehicle

Road Sign Recognition

Train

Infer

Generate Data

Contributors

Prof. Jianfeng Zhan, ICT, Chinese Acadmey of Sciences, and BenchCouncil    
Dr. Wanling Gao, ICT, Chinese Acadmey of Sciences    
Dr. Lei Wang, ICT, Chinese Academy of Sciences    
Chunjie Luo, ICT, Chinese Academy of Sciences
Tianshu Hao, ICT, Chinese Academy of Sciences
Zihan Jiang, ICT, Chinese Academy of Sciences
Yunyou Huang, ICT, Chinese Academy of Sciences
Dr. Chen Zheng, ICT, Chinese Academy of Sciences, and BenchCouncil    
Dr. Zheng Cao, Alibaba     
Hainan Ye, Beijing Academy of Frontier Sciences and BenchCouncil     
Dr. Zhen Jia, Princeton University and BenchCouncil
Daoyi Zheng, Baidu     
Shujie Zhang, Huawei     
Haoning Tang, Tencent     
Dr. Yingjie Shi
Zijian Ming, Tencent     
Yuanqing Guo, Sohu    
Yongqiang He, Dropbox
Kent Zhan, Tencent (Previously), WUBA(Currently)    
Xiaona Li, Baidu    
Bizhu Qiu, Yahoo!
Qiang Yang, BAFST    
Jingwei Li, BAFST    
Dr. Xinhui Tian, ICT, CAS    
Dr. Gang Lu, BAFST
Xinlong Lin, BAFST    
Rui Ren, ICT, CAS    
Dr. Rui Han, ICT, CAS    

Numbers

Benchmarking results are available soon.

License

AIBench is available for researchers interested in AI. Software components of AIBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use AIBench are required to fully understand and abide by the licensing terms of the various components. AIBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our AIBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our AIBench you must understand and comply with their licenses. Software developed externally (not by AIBench group)

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.