In recent years, with the trend of applying deep learning (DL) in scientific computing, the physical simulation is no longer the only class of problems to be solved in the HPC community. The unique characteristics of emerging scientific DL workloads raise great challenges in benchmarking and thus the community needs a new yard stick for evaluating the future HPC systems.

Consequently, we propose HPC AI500---a benchmark suite for evaluating HPC systems that running scientific DL workloads. Each workload from HPC AI500 bases on real scientific DL applications and covers the most representative scientific fields. Currently, we choose 14 scientific DL benchmarks from application scenarios, datasets, and software stack. Furthermore, we propose a set of metrics of comprehensively evaluating the HPC systems, considering both accuracy, performance as well as power and cost. In addition, we provide a scalable reference implementation of HPC AI500.

**Component Benchmarks.** Since object detection, image recognition, and image generation are the most representative DL tasks in modern scientific DL. We choose the following state-of-the-art models as the HPC AI500 component benchmarks.

1) Faster-RCNN. This benchmark targets real-time object detection. Unlike the previous object detection model, it replaces the selective search by a region proposal network that achieves nearly cost-free region proposals. Further more, Faster-RCNN combines the advanced CNN model as their base network for extracting features and is the foundation of the 1st-place winning entries in ILSVRC’15 (ImageNet Large Scale Visual Recognition Competition).

2) ResNet. This benchmark is a milestone in Image Recognition, marking the ability of AI to identify images beyond humans. It solves the degradation problem, which means in the very deep neural network the gradient will gradually disappear in the process of propagation, leading to poor performance. Due to the idea of ResNet, researchers successfully build a 152-layer deep CNN. This ultra deep model won all the awards in ILSVRC’15.

3) DCGAN. This benchmark is one of the popular and successful neural network for GAN. Its fundamental idea is replacing fully connected layers with convolutions and using transposed convolution for upsampling. The proposal of DCGAN helps bride the gap between CNNs for supervised learning and unsupervised learning.

**Micro Benchmarks.** We choose the following primary operators in CNN as HPC AI500 micro benchmarks.

1) Convolution. In mathematics, convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other. In a CNN, convolution is the operation occupying the largest proportion, which is the multiply accumulate of the input matrix and the convolution kernel, and then produces feature maps. There are many convolution kernels distributed in different layers responsible for learning different level features.

2) Full-connected. The full-connected layer can be seen as the classifier of a CNN, which is essentially matrix multiplication. It is also the cause of the explosion of CNN parameters. For example, in AlexNet, the number of training parameters of fully-connected layers reaches about 59 million and accounts for 94 percent of the total.

3) Pooling. Pooling is a sample-based discretization process. In a CNN, the objective of pooling is to down-sample the inputs (e.g., feature maps), which leads to the reduction of dimensionality and training parameters. In addition, it enhances the robustness of the whole network. The commonly used pooling operations including max-pooling and average-pooling.