AIBench: A Datacenter AI Benchmark Suite, BenchCouncil

 

NUMBER

We evaluate CPUs, GPUs and other AI accelerators using AIBench. Also, we evaluate the mobile chips or IoT chips using AIoT Bench. We evaluate the HPC AI systems using HPC AI 500. BenchCouncil will publish the performance numbers periodically, more intelligent chips and accelerators will be evaluated. BenchCouncil welcomes everyone join the benchmarking and submit their results who is interested in the performance of AI systems and architectures.

Methodology

We agree with DawnBench's (DawnBench Paper) choice on time-to-accuracy metric because some optimizations immediately improve traditional performance metrics like throughput while adversely affect the quality of the final model, which can only be observed by running an entire training session (MLPerf latest report). Unfortunately, the training time to a state-of-the-art accuracy requires a lot of execution time. However, we believe the cost of execution time cannot justify including only a few benchmarks. Actually, the cost of execution time for other benchmarks (like HPC, SPECCPU on simulator) is also prohibitively costly. However, the representativeness and coverage of a widely accepted benchmark suite are paramount important. For example, SPECCPU 2017 contains 43 benchmarks. The other examples include PARSEC3.0 (30), TPC-DS (99).

So AIBench adopts different strategies. We include more diverse benchmarks (16 problem domains, another face forensics is being added). While for performance ranking, we only choose a few representative benchmarks (less than MLPerf) for reducing the cost just like that the HPC Top500 ranking only reports HPL, HPCG, and Graph500 (three benchmarks out of 20+ representative HPC benchmarks like HPCC, NPB).

Metrics

BenchCouncil reports performance number, performance & accuracy number (time-to-accuracy), and energy consumption number (energy-to-accuracy) using AIBench.

  • Time-to-accuracy Number of a few representative benchmarks from AIBench.

    As the training time to a state-of-the-art accuracy requires a lot of execution time, for performance & accuracy ranking, we only choose a few representative benchmarks from AIBench for reducing the cost just like that the HPC Top500 ranking only reports three benchmarks.

  • Energy-to-accuracy Number of a few representative benchmarks from AIBench.

    As the training time to a state-of-the-art accuracy requires a lot of execution time, for energy ranking, we only choose a few representative benchmarks from AIBench for reducing the cost just like that the HPC Top500 ranking only reports three benchmarks.

  • Throughput Number of full benchmarks of AIBench.

    We evaluate CPUs, GPUs and other AI accelerators using the full benchmarks of AIBench. We run the benchmarks using optimized parameter settings to achieve the accuracy of referenced paper and report the throughput performance.

For workload characterization, since TBD paper and our previous work find that each iteration has the same computation logic and the iteration number has little impact on micro-architectural behaviors, we first adjust the parameters, e.g., batch size, and train the benchmarks to approaching to the accuracy stated in the referenced paper, and then use the same parameter settings and sample dozens of epochs to get micro-architectural results.

(1) Time-to-accuracy and Energy-to-accuracy Numbers

Available soon.

(2) Throughput Performance Number

Currently, BenchCouncil publish the performance numbers of intelligent chips, using six representative component benchmarks in AIBench. The numbers comprehensively evaluate eight NVIDIA GPUs, covering different types, different architectures, different memory capacities, and having different prices. The detailed information of the eight GPUs are listed in the Table.

Numbers of Intelligent Chips

GPU Type GPU Architecture GPU Memory
Tesla V100 NVIDIA Volta 32GB
Tesla V100 NVIDIA Volta 16GB
GeForce RTX 2080Ti NVIDIA Turing 11GB
GeForce RTX 2080 NVIDIA Turing 8GB
GeForce RTX 2070 NVIDIA Turing 8GB
Tesla P100 NVIDIA Pascal 16GB
Titan XP NVIDIA Pascal 12GB
GeForce GTX 1080Ti NVIDIA Pascal 11GB

We choose six benchmarks from AIBench to evaluate these eight GPUs, including image classification, Image-to-Image, speech recognition, object detection, Image-to-Text, and face embedding.

  • Image classification (ResNet50 Model) extracts different thematic classes within the input image, which is a supervised learning problem to define a set of target classes and train a model to recognize. This benchmark uses ResNet neural network and uses ImageNet as data input.
  • Image-to-Image (CycleGAN Paper) converts an image from one representation of an image to another representation, such as the style conversion and season change. This benchmark uses the cycleGAN algorithm and takes Cityscapes dataset as input.
  • Speech recognition (DeepSpeech2 Model) recognizes and translates the spoken language to text. This benchmark uses the DeepSpeech2 algorithm and takes Librispeech dataset as input.
  • Object detection (Faster R-CNN Model) detects the objects within an image. This benchmark uses the Faster R-CNN algorithm and takes Microsoft COCO dataset as input.
  • Image-to-Text (Neural Image Caption Model) generates the description of an image automatically. This benchmark uses Neural Image Caption model and takes Microsoft COCO dataset as input.
  • Face embedding (Facenet Model) transforms a facial image to a vector in embedding space. This benchmark uses the FaceNet algorithm and takes the VGGFace2 as input.

Training Performance on Single GPU

We evaluate the single GPU performance using six benchmarks. The result shows that the Tesla V100 has the best training performance on almost all workloads, and has 1-2 times performance improvement. GeForce RTX 2080Ti also has good performance on four workloads, closing to the performance of V100. Considering their prices, GeForce RTX 2080Ti has higher price–performance ratio. In addition, the AI workloads have higher memory requirements thus that the GPUs with 8 GB memory are no longer suit for partial AI workloads, such as object detection.




Training Performance on Multiple GPUs

We further evaluated the training performance on multiple GPUs, mainly 2 cards and 4 cards. We find that from the perspective of absolute performance on multiple cards, Tesla V100 of 32 GB memory has the highest performance, which performs 2 times better than the other GPU types at most. However, from the perspective of the speedup ratio comparing multiple cards with single card, V100 do not have the highest speedup ratio, with 1.77 (highest: 1.83 P100) times speedup ratio on 2 cards and 3.16 (highest: 3.37 Titan XP) times on 4 cards. In addition, Titan XP and GeForce RTX 2080Ti have good performance and have higher price–performance ratio comparing to V100.




Inference Performance on Single GPU

To evaluate the inference performance of intelligent chips, we also evaluate the inference time on single GPU card. We find that the V100 also has the best inference performance, which performs 4 times better than the others at most. The inference performance of Titan XP is colse to V100 and has higher price–performance ratio. Likewise, the inference of partial AI workloads also have high memory requirements, such as object detection.