Skip navigation

HPC AI500: A Benchmark Suite for HPC AI Systems

 

HPC AI500 Ranking, Image Classification, Free Level, July 2, 2020




RankSource VPFLOPSTime (min)QualityAI accFramework
1Fujitsu [1]31.411.275.1%Tesla V100 * 2048MXNet
2Google [2]20.102.276.3%TPU V3 * 1024TensorFlow
3Sony [3]10.023.775.0%Tesla V100 * 2176 NNL
4Tencent [4]6.446.676.0%Tesla P100 * 1024Chainer
5Preferred Network [5]2.411574.9%Tesla P40 * 2048TensorFlow
6Berkeley [6]1.952075.4%KNL * 2048Intel Caffe
7Intel [7]1.272874.6%KNL * 1536Intel Caffe
8IBM [8]0.755075.0%Tesla P100 * 256Caffe
9Facebook [9]0.706076.3%Tesla P100 * 1024Caffe2

The data (unverified) are collected from the original papers and technical reports.
Submission Contact: Please contact benchcouncil@gmail.com to submit a new benchmarking result.

Top 3 HPC AI Systems

  • Rank 1: Fujitsu
  • Achieves 31.41 Valid PFLOPS, and finishes Image Classification in 1.2 minutes. The hardware consists of 2048 Nvidia Tesla V100 GPUs. They propose a novel communication algorithm by optimal scheduling group layers and implement a CUDA kernel that dedicated to calculating norms in parallel. They also leverage the TensorCore of Tesla V100 by mixed-precision training.
  • Rank 2: Google
  • Achieves 20.10 Valid PFLOPS, and finishes Image Classification in 2.2 minutes. The hardware consists of 1024 TPU V3. They propose a 2D-mesh all-reduce for highly efficient communication and implement the batch normalization in a distributed manner. They leverage BFLOAT16, which is the unique precision representation in TPU, for mixed precision training.
  • Rank 3: Sony
  • Achieves 10.02 VPFLOPS, and finishes Image Classification in 3.7 minutes. The hardware consists of 2176 Nvidia Tesla V100 GPUs. They propose a 2D-Turus all-reduce for highly efficient communication and eliminate the moving average in batch normalization. They also leverage the TensorCore of Tesla V100 by mixed-precision training.

Benchmarks

HPC AI500 Benchmarks

Problem DomainDataset Target qualityEpochs
Image ClassificationImageNet Top1 Accuracy = 0.76390
Extreme Weather AnalyticsThe extreme weather dataset mAP@[IoU=0.5]=0.3550

Metrics

The primary metric is Valid FLOPS, which is calculated by the following equation:

VFLOPS = FLOPS * (achieved quality/ target quality) ^n

Achieved quality represents the actual model quality achieved in the evaluation; target quality is the state-of-the-art model quality, predefined in HPC AI500 benchmark. N is a positive integer, indicating the sensitivity to the model quality. In image Classification, the target quality is top1 accuracy=0.763 and the value of n is 5 as default. In extreme weather analytics, the target quality is mAP@[IoU=0.5]=0.35 and the value of n is 10 as default.

Methodology

As shown in Figure 2, HPC AI500 benchmarking methodology provides three benchmarking levels, including hardware level, system level, and free level.

  • Hardware level, users can change layer 1 to layer 4. For the other layers, the benchmark users can only change parallel modes inLayer 6 or tune learning rate policies and batchsize settings in Layer 8.
  • System level, In addition to the changes allowed in the hardware level, the users areallowed to re-implement the algorithms on different or even customized AI framework (Layer 5).
  • Free level, users can change any layers from Layer 1 to Layer 8 while keeping Layer 9 intact. The same data set, target quality, and training epochs are defined in Layer 9 while the other layers are open for optimizations.

Figure 2: HPC AI500 V2.0 Methodology.

References

1. M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, andK. Nakashima, “Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds,” arXivpreprint arXiv:1903.12650, 2019.
2. C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classification at supercomputer scale,” arXiv preprint arXiv:1811.06992, 2018.
3. Y. Tanaka and Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds”.
4. X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly Scal- able Deep Learning Training System with Mixed–Precision: Training ImageNet in Four Minutes. arXiv:1807.11205, 2018.
5. T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd: Training resnet-50 on imagenetin 15 minutes,” arXiv preprint arXiv:1711.04325, 2017.
6. Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” inProceedings of the 47th International Conference on Parallel Processing, ICPP 2018, (New York,NY, USA), Association for Computing Machinery, 2018.
7. V. Codreanu, D. Podareanu, and V. Saletore, “Scale out for large minibatch sgd: Residual net-work training on imagenet-1k with improved accuracy and reduced time to train,”arXiv preprintarXiv:1711.04291, 2017.
8. M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar, “Powerai ddl,”arXiv preprintarXiv:1708.02188, 2017.
9. P. Goyal, P. Doll ́ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprintarXiv:1706.02677, 2017.