To keep the community open, inclusive, and growing, we recommend influential benchmark and tool projects from BenchCouncil and other organizations. If you are willing to be recommended or not recommended, please do not hesitate to contact us (


All information, content, and materials available on this site are for general informational purposes only. Information on this website may not constitute the most up-to-date legal or other information. This website contains links to other third-party websites. Such links are only for the convenience of the reader, user or browser; BenchCouncil and its members do not endorse the contents of the third-party sites and shall not to be held responsible for any damages, injuries, or losses that occur while using them.

Big Data

(1) BigDataBench

The latest version BigDataBench 5.0 provides 13 representative real-world data sets and 27 big data benchmarks. The benchmarks cover six workload types including online services, offline analytics, graph analytics, data warehouse, NoSQL, and streaming from three important application domains, Internet services (including search engines, social networks, e-commerce), recognition sciences, and medical sciences. The benchmark suite includes micro benchmarks, each of which is a single data motif, components benchmarks, which consist of the data motif combinations, and end-to-end application benchmarks, which are the combinations of component benchmarks. Meanwhile, data sets have great impacts on workloads behaviors and running performance. Hence, data varieties are considered with the whole spectrum of data types including structured, semi-structured, and unstructured data. Currently, the included data sources are text, graph, table, and image data. Using real data sets as the seed, the data generators—BDGS—generate synthetic data by scaling the seed data while keeping the data characteristics of raw data.


(2) TPCx-BB (BigBench)

The TPCx-BB (Transaction Processing Performance Council's BigBench) benchmark suite is a standardized performance evaluation test designed by the Transaction Processing Performance Council (TPC) to measure the capabilities of big data systems. The benchmark includes a set of workloads that simulate real-world big data processing tasks, such as data ingestion, data transformation, and data analysis. TPCx-BB consists of a suite of 30 queries and workloads, simulating real-world big data scenarios such as customer behavior analysis, social network analysis, and text processing. The benchmark uses a comprehensive dataset, including structured and semi-structured data, to test a system's ability to handle various data types and formats. TPCx-BB is a benchmark for assessing the performance, scalability, and price-performance of big data systems.


(3) HiBench

HiBench is a benchmark suite developed by Intel to evaluate the performance of big data frameworks. It is designed to provide a comprehensive evaluation of the system's performance across various workloads commonly found in big data applications. The HiBench benchmark suite includes a set of micro and macro benchmarks that test the performance of the system in areas such as data generation, data sorting, machine learning, graph processing, and web search. The benchmarks are designed to simulate real-world workloads and provide a standardized way to compare the performance of different big data systems.


(4) CloudSuite

CloudSuite is a benchmark suite for cloud services. The fourth release consists of eight first-party applications that have been selected based on their popularity in today's datacenters. The benchmarks are based on real-world software stacks and represent real-world setups.



CALDA is a benchmarking effort targeting MapReduce systems and parallel DBMSes. Its workloads are from the original MapReduce paper [34] and add four complex analytical tasks.


(6) YCSB

YCSB released by Yahoo! is a benchmark for data storage systems and only includes online service workloads, i.e. Cloud OLTP. The workloads are mixes of read/write operations to cover a wide performance space.


(7) AMP Benchmarks

AMP benchmark is a big data benchmark proposed by UC Berkeley, which focuses on real-time analytic applications. This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes.


(8) SPEC Cloud® IaaS 2018

The SPEC Cloud® IaaS 2018 benchmark addresses the performance of infrastructure-as-a-service (IaaS) cloud platforms. IaaS cloud platforms can either be public or private.



(1) AIBench Training

AIBench Training adopts a balanced AI benchmarking methodology considering comprehensiveness, representativeness, affordability, and portability. The methodology widely investigates AI tasks and models and covers the algorithm-level, system-level, and microarchitectural-level factors space to the most considerable extent. From the algorithm level, the commonly used building blocks, model layers, loss functions, optimizers, FLOPs, and different-scale parameter sizes are considered; From the system level, the convergent rate and hot functions are considered. From the microarchitectural level, the diverse computation and memory access patterns are considered. AIBench Training covers nineteen representative AI tasks with state-of-the-art models to guarantee diversity and representativeness. Besides, two AIBench Training subset are provided: RPR and WC subsets to achieve affordability.


(2) AIBench Inference

Through thoroughly analyze the core scenarios among three primary Internet services, including search engine, social network, and e-commerce, AIBench Inference provides nineteen workloads, each of which represent a representative AI task.


(3) ScenarioBench

Instead of using real-world applications or implementing a full-fledged application from scratch, ScenarioBench proposes the permutations of essential tasks as a scenario benchmark. The goal is to identify the critical paths and primary modules of a real-world scenario since they consume the most system resources and are the core focuses for system design and optimization. Each scenario benchmark distillates the crucial attributes of an industry-scale application and reduces the side effect of the latter’s complexity in terms of huge code size, extreme deployment scale, and complex execution paths.


(4) AI Matrix

AI Matrix results from a full investigation of the DL applications used inside Alibaba and aims to cover the typical DL applications that account for more than 90% of the GPU usage in Alibaba data centers. The collected benchmarks mainly fall into three categories: computer vision, recommendation, and language processing, which consist of the most majority of DL applications in Alibaba.


(5) Dcbench

Dcbench aims to provide a standardized way to evaluate the tools and systems for data-centric AI development.


(6) DAWNBench

DAWNBench is a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference time with that accuracy.


(7) Fathom

Fathom is a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group.


(8) MLPerf Training Benchmark

MLPerf training benchmark measures how fast systems can train models to a target quality metric. It contains 8 workloads, each of which is defined by a Dataset and Quality Target.


(9) MLPerf Inference Benchmark

MLPerf inference benchmark presents the benchmarking method for evaluating ML inference systems, and prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures.


(10) TPCx-AI

TPCx-AI is an end to end AI benchmark standard developed by the TPC. The benchmark measures the performance of an end-to-end machine learning or data science platform. The benchmark development has focused on emulating the behavior of representative industry AI solutions that are relevant in current production datacenters and cloud environments.



(1) HPC AI500 V3.0

HPC AI500 V3.0 is a scalable and customizable framework for HPC AI benchmarking. The methodology of HPC AI500 V3.0 allows users to integrate existing AI benchmarks in a bagging manner, a meta-algorithm of ensemble learning with intrinsic high parallelism, leading to scalable benchmarking. The bagging management and model parallelism management of HPC AI500 V3.0 gives users the flexibility to control the size of model ensembles and the degree of model parallelism, enabling various optimizations from both system and algorithm levels. Based on HPC AI500 V2.0, which tackles the equivalence, representativeness, affordability, and repeatability issues, HPC AI500 V3.0 provides a complete HPC AI benchmarking framework.


(2) HPL-MxP

The HPL-MxP benchmark seeks to highlight the emerging convergence of high-performance computing (HPC) and artificial intelligence (AI) workloads. The innovation of HPL-MxP lies in dropping the requirement of 64-bit computation throughout the entire solution process and instead opting for low-precision (likely 16-bit) accuracy for LU, and a sophisticated iteration to recover the accuracy lost in factorization.


(3) MLPerf HPC

MLPerf HPC is a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons Association.


(4) AIPerf

AIPerf is an end-to-end benchmark suite utilizing automated machine learning (AutoML). It represents real AI scenarios, and scales auto-adaptively to various scales of machines.


AI for science

(1) SAIBench

Scientific research communities are embracing AI-based solutions to target tractable scientific tasks and improve research work flows. However, the development and evaluation of such solutions are scattered across multiple disciplines. SAIBench formalizes the problem of scientific AI benchmarking, and try to unify the efforts and enable low-friction on-boarding of new disciplines. SAIBench uses a domain-specific language to decouple research problems, AI models, ranking criteria, and software/hardware configuration into reusable modules.



Spiking Neural Networks (SNN) in AI

(1) SNNBench

SNNBench is the first end-to-end benchmark that covers both the training and inference phases, training the model to a target accuracy. It takes into account various aspects including domains, training paradigms, learning rules, spiking neurons, and connection types. It encompasses image classification and speech recognition workloads, and SNNBench compares different learning rules based on training/inference speed, training stability, and accuracy. Additionally, it provides a detailed characterization of the operator percentages in SNNs and assesses the scalability of SNNs.


(2) Benchmark from SNABSuite

This benchmark focuses on the inference phase for different hardware backends, specifically for neuromorphic hardware. However, it does not include a training phase and only contains an image classification workload but includes different architectures. Additionally, it provides a sweep strategy to search for the best configuration for different resource-restricted hardware like low-memory hardware.


(3) Benchmark from Kulkarni et al

This benchmark consists of workloads for machine learning, encompassing different learning rules, including backpropagation, reservoir, and evolutionary. However, it only mimics part of the training process and does not conduct an entire training session, therefore, it does not contain accuracy information. The project is not open-sourced as of now.


Spiking Neural Networks (SNN) in computational neuroscience

(1) Simulation of networks of spiking neurons: A review of tools and strategies

This project provides a detailed review of the simulation tools for spiking neural models and proposes a benchmark suite containing four workloads and different neural models, including the leaky integrate-and-fire model and the complicated Hodgkin–Huxley (HH) model.

(2) Software for Brain network simulations: a comparative study

This project proposes two benchmarks, including a Classical Pyramidal InterNeuron Gamma (PING) network of leaky integrate-and-fire neurons and a Postinhibitory Rebound—InterNeuron Gamma (PIR-ING) network of Hodgkin–Huxley neurons. It provides the implementation of BRAIN, NEURON, and NEST simulators.


Edge, IoT and mobile

(1) Edge AIBench V3.0

Edge AIBench V3.0 is a scenario benchmark for IoT-Edge-Cloud systems, which proposes a set of distilling rules for replicating autonomous vehicle scenarios to extract critical tasks with intertwined interactions. The essential system-level and component-level characteristics are captured while the system complexity is reduced significantly so that users can quickly evaluate and pinpoint the system and component bottlenecks. Also, Edge AIBench V3.0 implements a scalable architecture through which users can assess the systems with different sizes of workloads.


(2) AIoTBench

AIoTBench focuses on the evaluation of the inference ability of mobile and embedded devices. Considering the representative and diversity of both model and framework, AIoTBench covers three typical heavy-weight networks: ResNet50, InceptionV3, DenseNet121, as well as three light-weight networks: SqueezeNet, MobileNetV2, MnasNet. Each model is implemented by three popular frameworks: Tensorflow Lite, Caffe2, Pytorch Mobile. For each model in Tensorflow Lite, we also provide three quantization versions: dynamic range quantization, full integer quantization, float16 quantization.


(3) IoTBench

IoTBench is a data centrical and configurable IoT benchmark suite. It covers three types of algorithms commonly used in IoT applications: matrix processing, list operation, and convolution.


(4) Flet-Edge

Flet-Edge is a full life-cycle evaluation tool for deep learning framework on the edge. To describe the full life-cycle performance of frameworks on the edge, a comprehensive metric set PDR is proposed, including three comprehensive sub-metrics: Programming complexity, Deployment complexity, and Runtime performance. Flet-Edge supports automatically collect and present the PDR's metrics, visually.


(5) UL Procyon AI Inference Benchmark

The UL Procyon AI Inference Benchmark measures the AI performance of Android devices using NNAPI. The benchmark score reflects both the speed and the accuracy of on-device inferencing operations. With the Procyon AI Inference Benchmark, not only can you measure the performance of dedicated AI processing hardware in Android devices, you can also verify NNAPI implementation quality.


(6) MLMark

The EEMBC MLMark® benchmark is a machine-learning (ML) benchmark designed to measure the performance and accuracy of embedded inference. The motivation for developing this benchmark grew from the lack of standardization of the environment required for analyzing ML performance. MLMark is targeted at embedded devlopers, and attempts to clarify the environment in order to facilitate not just performance analysis of today's offerings, but tracking trends over time to improve new ML architectures.


(7) AI Benchmark

AI Benchmark tests several key AI tasks on the phone and professionally evaluates its performance.


(8) MLPerf Tiny Benchmark

MLPerf Tiny measures the accuracy, latency, and energy of machine learning inference to properly evaluate the tradeoffs between the ultra-low-power tiny machine learning (TinyML) systems.


(9) MLPerf Mobile Inference Benchmark

MLPerf mobile inference benchmark comes as a mobile app for different computer vision and natural language processing tasks. The benchmark also supports non-smartphone devices, such as laptops and mobile PCs.


(10) EDLAB

EDLAB is a benchmarking evalution tool to automatically evaluate different edge deep learning platforms.


NLP and Big Language Models

(1) AGIBench

AGIBench is a multi-granularity, multimodal, human-referenced, and auto-scoring benchmark for LLMs. Instead of a collection of blended questions, AGIBench focuses on three typical ability branches and adopts a four-tuple \ to label the attributes of each question. First, it supports multi-granularity benchmarking, e.g., per-question, per-ability branch, per-knowledge, per-modal, per-dataset, and per-difficulty level granularities. Second, it contains multimodal input, including text and images. Third, it classifies all the questions into five degrees of difficulty according to the average accuracy rate of abundant educated humans (human-referenced). Fourth, it adopts zero-shot learning to avoid introducing additional unpredictability and provides an auto-scoring method to extract and judge the result. Finally, it defines multi-dimensional metrics, including accuracy under the average, worst, best, and majority voting cases, and repeatability.

Homepage: (Paper, GitHub)

(2) BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.


(3) HELM

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. Holistic Evaluation of Language Models (HELM) has two levels: (i) an abstract taxonomy of scenarios and metrics to define the design space for language model evaluation and (ii) a concrete set of implemented scenarios and metrics that were selected to prioritize coverage (e.g. different English varieties), value (e.g. user-facing applications), and feasibility (e.g. limited engineering resources).


(4) SuperGLUE

New models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced one year ago, offered a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently come close to the level of non-expert humans, suggesting limited headroom for further research.


(5) EleutherAI LM Evaluation Harness

EleutherAI LM Evaluation Harness project provides a unified framework to test autoregressive language models (GPT-2, GPT-3, GPTNeo, etc) on a large number of different evaluation tasks. It implements 200+ tasks and provides supports for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface.



(1) Linpack

The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report.


(2) HPCC

HPC Challenge is a benchmark suite that measures a range memory access patterns. The HPC Challenge benchmark consists of basically 7 tests: HPL, STREAM, PTRANS, RandomAccess, FFT, and Communication bandwidth and latency.


(3) SPEChpc 2021

HPC systems are getting built with an increased level of heterogeneity. The numerous types of accelerators bring in tremendous extra computing power, while at the same time introduce big challenges in performance evaluation and characterization. The SPEChpc 2021 Benchmark Suites address these challenges by providing a set of application benchmark suites using a comprehensive measure of real-world performance for the state-of-the-art HPC systems. They offer well-selected science and engineering codes that are representative of HPC workloads and are portable across CPU and accelerators, along with certain fair comparative performance metrics.


(4) NAS Parallel Benchmark

The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification.


CPU and Accelerators

(1) WPC

WPC is a whole-picture workload characterization (in short, WPC) methodology and the tool that integrates microarchitecture-dependent, microarchitecture-independent, and ISA-independent characterization methodologies. It performs a whole-picture analysis on hierarchical profile data across Intermediate Representation (IR), ISA, and microarchitecture to sum up the inherent workload characteristics and understand the reasons behind the numbers.


(2) CPUBench



The SPEC CPU® 2017 benchmark package contains SPEC's next-generation, industry-standardized, CPU intensive suites for measuring and comparing compute intensive performance, stressing a system's processor, memory subsystem and compiler.



The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of multithreaded programs. The suite focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors.


(5) iMLBench

iMLBench is a machine learning benchmark suite targeting CPU-GPU integrated architectures. It provides machine learning workloads including Linear Regression (LR), K-means (KM), K Nearest Neighbor (KNN), Back Propagation (BP), 2D Convolution Neural Network (2DCNN), 3D Convolution Neural Network (3DCNN), Multi-layer Perceptron (MLP), and Winograd Convolution (Winograd).


(6) DeepBench

The primary purpose of DeepBench is to benchmark operations that are important to deep learning on different hardware platforms. DeepBench includes operations and workloads that are important to both training and inference.



(1) DPUBench

DPUBench is application-driven scalable benchmark suite. It classifies DPU applications into three typical scenarios - network, storage, and security, and includes a scalable benchmark framework that contains essential Operator Set in these scenarios and End-to-end Evaluation Programs in real data center scenarios.



(1) OLxPBench

OLxPBench is a composite Hybrid Transactional/Analytical Processing (HTAP) benchmark suite, that emphasizes the necessity of real-time queries, semantically consistent schema, and domain-specific workloads in benchmarking, designing, and implementing HTAP systems. OLxPBench proposes: (1) the abstraction of a hybrid transaction, performing a real-time query in-between an online transaction, to model widely-observed behavior pattern—making a quick decision while consulting real-time analysis; (2) a semantically consistent schema to express the relationships between OLTP and OLAP schema; (3) the combination of domain-specific and general benchmarks to characterize diverse application scenarios with varying resource demands.


(2) mOLxPBench

HTAP databases are extreme lack the micro-benchmark because there is no open-source micro-benchmark. We design and implement a micro-benchmark for HTAP databases, which can precisely control the rate of fresh data generation and the granularity of fresh data access. This stands as a distinguishing hallmark, setting micro-benchmarks apart from conventional HTAP benchmarks. It is important to underscore that an effective evaluation of HTAP databases necessitates the integration of both macro and micro benchmarks.


(3) TPC-C

TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC).


(4) TPC-E

TPC Benchmark E is an on-line transaction processing (OLTP) benchmark. TPC-E is more complex than previous OLTP benchmarks such as TPC-C because of its diverse transaction types, more complex database and overall execution structure. TPC-E involves a mix of twelve concurrent transactions of different types and complexity, either executed on-line or triggered by price or time criteria. The database is comprised of thirty-three tables with a wide range of columns, cardinality, and scaling properties. TPC-E is measured in transactions per second (tpsE).


(5) TPC-H

The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries.


(6) TPC-DS

TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload.


(7) LinkBench

LinkBench is a database benchmark developed to evaluate database performance for workloads similar to those of Facebook's production MySQL deployment. It can be reconfigured to simulate a variety of workloads and plugins can be written for benchmarking additional database systems.


Power Systems

(1) PowerSystemBench