We evaluate the datacenter AI benchmark suite---AIBench on multiple GPUs and the IoT AI benchmark suite---AIoT Bench on multiple mobile phone chips. BenchCouncil will publish the performance numbers periodically, more intelligent chips and accelerators will be evaluated. Also, we encourage hardware providers and third-party organizations join the benchmarking and report their numbers.

Metrics: The training metrics are the wall clock time to train the specific epochs, the wall clock time to train a model achieving a target accuracy, and the energy consumption to train a model achieving a target accuracy.
The inference metrics are the wall clock time, accuracy, and energy consumption.

Numbers of Intelligent Chips

BenchCouncil publish the first round performance numbers of intelligent chips, using six representative component benchmarks in AIBench. The numbers comprehensively evaluate eight NVIDIA GPUs, covering different types, different architectures, different memory capacities, and having different prices. The detailed information of the eight GPUs are listed in the Table.

GPU Type GPU Architecture GPU Memory
Tesla V100 NVIDIA Volta 32GB
Tesla V100 NVIDIA Volta 16GB
GeForce RTX 2080Ti NVIDIA Turing 11GB
GeForce RTX 2080 NVIDIA Turing 8GB
GeForce RTX 2070 NVIDIA Turing 8GB
Tesla P100 NVIDIA Pascal 16GB
Titan XP NVIDIA Pascal 12GB
GeForce GTX 1080Ti NVIDIA Pascal 11GB

We choose six benchmarks from AIBench to evaluate these eight GPUs, including image classification, Image-to-Image, speech recognition, object detection, Image-to-Text, and face embedding.

  • Image classification extracts different thematic classes within the input image, which is a supervised learning problem to define a set of target classes and train a model to recognize. This benchmark uses ResNet neural network and uses ImageNet as data input.
  • Image-to-Image converts an image from one representation of an image to another representation, such as the style conversion and season change. This benchmark uses the cycleGAN algorithm and takes Cityscapes dataset as input.
  • Speech recognition recognizes and translates the spoken language to text. This benchmark uses the DeepSpeech2 algorithm and takes Librispeech dataset as input.
  • Object detection detects the objects within an image. This benchmark uses the Faster R-CNN algorithm and takes Microsoft COCO dataset as input.
  • Image-to-Text generates the description of an image automatically. This benchmark uses Neural Image Caption model and takes Microsoft COCO dataset as input.
  • Face embedding transforms a facial image to a vector in embedding space. This benchmark uses the FaceNet algorithm and takes the VGGFace2 as input.

Training Performance on Single GPU

We evaluate the single GPU performance using six benchmarks. The result shows that the Tesla V100 has the best training performance on almost all workloads, and has 1-2 times performance improvement. GeForce RTX 2080Ti also has good performance on four workloads, closing to the performance of V100. Considering their prices, GeForce RTX 2080Ti has higher price–performance ratio. In addition, the AI workloads have higher memory requirements thus that the GPUs with 8 GB memory are no longer suit for partial AI workloads, such as object detection.

Training Performance on Multiple GPUs

We further evaluated the training performance on multiple GPUs, mainly 2 cards and 4 cards. We find that from the perspective of absolute performance on multiple cards, Tesla V100 of 32 GB memory has the highest performance, which performs 2 times better than the other GPU types at most. However, from the perspective of the speedup ratio comparing multiple cards with single card, V100 do not have the highest speedup ratio, with 1.77 (highest: 1.83 P100) times speedup ratio on 2 cards and 3.16 (highest: 3.37 Titan XP) times on 4 cards. In addition, Titan XP and GeForce RTX 2080Ti have good performance and have higher price–performance ratio comparing to V100.

Inference Performance on Single GPU

To evaluate the inference performance of intelligent chips, we also evaluate the inference time on single GPU card. We find that the V100 also has the best inference performance, which performs 4 times better than the others at most. The inference performance of Titan XP is colse to V100 and has higher price–performance ratio. Likewise, the inference of partial AI workloads also have high memory requirements, such as object detection.