Component benchmarks

Table 1: The Component Benchmarks Specifications.

ComponentScenariosReference ModelsDatasetScientific Flelds
Image RecognitionIdentifying particle signalResNet-50HEP DatasetHigh Energy Physics
Image GenerationGenerating Cosmological imagesDCGANCos DatasetCosmology
Object DetectionExtreme weather analysisFaster-RCNNEW DatasetClimate Analysis
Sequence PredictionPredicting spectrum of peptidesBiLSTMpDeep DatasetComputational Biology

Scientific AI Scenarios

Identifying Particle Signal

Particle collision is the most important experiment approach in High Energy Physics (HEP). Detecting the signal of new particle is the major goal in experimental HEP. Today's HEP experimental facility such as LHC creates particle signals with hundreds of millions channels with a high data rate. The signal data from different channels in every collision usually are represented as a sparse 2d image, so called a jet-image. In fact, accurately classifying these jet-images is the key to find signals of new particles.

Generating Cosmological Images

Cosmology is a branch of astronomy concerned with the studies of the origin and evolution of the universe, from the Big Bang to today and on into the future. In 21st century, the most fundamental problem in cosmology is the nature of dark energy. However, this mysterious energy greatly affects the distribution of matter in the universe that is described by cosmological parameters. In order to predict cosmological parameters, scientists must prepare massive cosmological images. For guaranting the high fidelity numerical simulations and avoiding the use of expensive instruments, generating high quality cosmological images is important.

Extreme Weather Analysis

Extreme Weather Analysis poses a great challenge to human society. It brings severe damage to people health and economy every single year. For instance, the heatwaves in 2018 caused over 1600 deaths according to the UN report. And the landfall of hurricane Florence and Michael caused about 40 billion dollars worth of damage to US economy. In this context, understanding extreme weather life cycle and even predicting its future trend become a significant scientific goal. Achieving this goal always requires accurately identifying the extreme weather patterns to acquire the insight of climate change based on massive climate data analysis. Now, the reference implementation is available, which is the benchmark for BenchCouncil 2019 System award.

Predicting Spectrum of Peptides

In tandem mass spectrometry (MS/MS)-based proteomics, search engines rely on comparison between an experimental MS/MS spectrum and the theoretical spectra of the candidate peptides. Hence, accurate prediction of the theoretical spectra of peptides appears to be particularly important.

Reference Models


ResNet is a milestone in Image Recognition, marking the ability of AI to identify images beyond humans. It solves the degradation problem, which means in the very deep neural network the gradient will gradually disappear in the process of propagation, leading to poor performance. Due to the idea of ResNet, researchers successfully build a 152-layer deep CNN. This ultra deep model won all the awards in ILSVRC'15.


DCGAN is one of the popular and successful neural network for GAN. Its fundamental idea is replacing fully connected layers with convolutions and using transposed convolution for upsampling. The proposal of DCGAN helps bride the gap between CNNs for supervised learning and unsupervised learning.


Faster-RCNN targets real-time object detection. Unlike the previous object detection model (RCNN,fastRCNN), it replaces the selective search by a region proposal network that achieves nearly cost-free region proposals. Further more, Faster-RCNN combines the advanced CNN model as their base network for extracting features and is the foundation of the 1st-place winning entries in ILSVRC'15 (ImageNet Large Scale Visual Recognition Competition)


Bi-LSTM(Bi-directional Long Short-Term Memory) is composed of forward LSTM and backward LSTM. It is very suitable for sequence tagging tasks with upper and lower relationships, so it is often used to model context information in NLP. BiLSTM can be seen as an improved version of the LSTM to better capture bi-directional semantic dependencies.

The Selected Dataset

The HEP Dataset

The HEP Dataset is divided into two classes: the RPV-Susy signal and the most prevalent background. The training data set is composed of around 400 k jet-images. Each jet-image is represented as a 64*64 sparse matrix and has 3 channels. It also provides validation and test data. All the data are generated by using the Pythia event generator interfaced to the Delphes fast detector simulation.

The Cos Dataset

The Cosmology aims to generate the images of galaxies. It is based on dark matter N-body simulations produced using the MUSIC and pycola packages. Each simulation covers the volumes of 512h^{-1}Mpc^3 and contains 512^3 dark matter particles.

The EW Dataset

The ExtremeWeather Dataset is made up of 26-year of climate data. The data of every year is available as one HDF5 file. Each HDF5 file contains two data sets: images and boxes. Images data set has 1460 example dense images (4 per day, 365 days per year) with 16 channels. Each channel is 768 * 1152 corresponding to one measurement per 25 square km on earth. Boxes dataset records the coordinates of the four extreme weather events in the corresponding images: tropical depression, tropical cyclone, extratropical cyclone and the atmospheric river. This dataset is now avaliable, see The Extreme Weather Dataset.

pDeep Dataset

pDeep Dataset collected ∼4000000 high-quality, high-resolution MS/MS spectra from ProteomeTools and other reliable proteomic data sets, including HCD, ETD, and EThcD spectra. Please see this link for the further information.


At present, time-to-accuracy is the most well-received solution(e.g. DAWNBench and MLPerf). For comprehensive evaluate, the training accuracy and validation accuracy are both provided. The former is used to measure the training effect of the model, and the latter is used to measure the generalization ability of the model. The threshold of target accuracy is defined as a value according to the requirement of corresponding application domains. Each application domain needs to define its own target accuracy. There is a paper to analyze the rationality of this metric. In addition, cost-to-accuracy and power-to-accuracy are provided to measure the money and power spending of training the model to the target accuracy.

Reference Implementation

Based on the extreme weather dataset, we provide a scalable reference implementation of HPC AI500 component benchmarks, see Download for details.

Micro benchmarks

We choose the following primary operators in CNN as our micro benchmarks.


In mathematics, convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other. In a CNN, convolution is the operation occupying the largest proportion, which is the multiply accumulate of the input matrix and the convolution kernel, and then produces feature maps. There are many convolution kernels distributed in different layers responsible for learning different level features.


The full-connected layer can be seen as the classifier of a CNN, which is essentially matrix multiplication. It is also the cause of the explosion of CNN parameters. For example, in AlexNet, the number of training parameters of fully-connected layers reaches about 59 million and accounts for 94 percent of the total.


Pooling is a sample-based discretization process. In a CNN, the objective of pooling is to down-sample the inputs (e.g., feature maps), which leads to the reduction of dimensionality and training parameters. In addition, it enhances the robustness of the whole network. The commonly used pooling operations including max-pooling and average-pooling.


The metrics of the micro benchmarks is simple since we only measure the performance without considering accuracy. we adopt FLOPS and images per second(images/s) as two main metrics. We also consider power and cost related metrics.