News: BigDataBench 4.0 will be released at March 1st, 2018. BigDataBench 4.0, BigDataDwarf and BOPS Technical Report. BigDataBench Tutorial at ASPLOS’18 (an Open Source Big Data and AI Benchmark suite), BPOE-9 Call For Papers (due to February 13, 2018, co-located with ASPLOS 2018). TR on A Dwarf-based Scalable Big Data Benchmarking Methodology. BigData100 ranking.
Two recent BigDataBench slides [BigDataBench-WBDB2015] [BigDataBench-HPBDC2015]. TR on Eight Dwarfs workloads in Big Data Analytics. China’s first industry standard big data benchmark suite. BigDataBench Tutorial at Micro 2014 . BigDataBench on MARSSx86, gem5, and Simics.
Summary
As a multi-discipline research and engineering effort, i.e., system, architecture, and data management, from both industry and academia, BigDataBench is an open-source big data benchmark suite. In nature, BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). The current BigDataBench 3.2 version models five typical and important big data application domains: search engine, social networks, e-commerce, multimedia analytics, and bioinformatics. In total, it includes 14 real-world data sets, and 34 big data workloads.
In specifying representative big data workloads, BigDataBench focuses on units of computation that are frequently appearing in OLTP, NoSQL, OLAP, interactive and offline analytics, graph computing, and streaming computing in each application domain. It identifies eight dwarf workloads in big data analytics (Please refer to our technical reports [PDF1,PDF2]). Meanwhile, it considers variety of data models with different types and semantics, which are extracted from real-world data sets, including unstructured, semi-structured, and structured data. BigDataBench also provides an end-to-end application benchmarking framework (Please refer to our DASFAA paper) to allow the creation of flexible benchmarking scenarios by abstracting data operations and workload patterns, which can be extended to other application domains.
For the same big data benchmark specifications, different implementations are provided. For example, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, Flink, interactive analytics and OLAP workloads using Shark, Impala, and Hive. In addition to including real-world data sets, BigDataBench also provides several parallel big data generation tools—BDGS—to generate scalable big data, e.g., a PB scale, from small or medium-scale real-world data while preserving their original characteristics.
To model and reproduce multi-application or multi-user scenarios on Cloud or datacenters, we provide the multi-tenancy version of BigDataBench, which allows flexible setting and replaying of mixed workloads according to the real workload traces—the Facebook, Google and Sogou traces.
For system and architecture researches, i. e., architecture, OS, networking and storage, the number of benchmarks will be multiplied by different implementations, and hence become massive. To reduce the research or benchmarking cost, we select a small number of representative benchmarks, called the BigDataBench subset according to workload characteristics from a specific perspective. For example, for architecture communities, as simulation-based research is very time-consuming, we provide the BigDataBench architecture subset on the MARSSx86, gem5, and Simics simulator versions, respectively.
Together with several industry partners, including Telecom Research Institute Technology, Huawei, Intel (China), Microsoft (China), IBM CDL, Baidu, Sina, INSPUR, ZTE and etc, we also release China’s first industry standard big data benchmark suite—-BigDataBench-DCA, which is a subset of BigDataBench.
Why BigDataBench?
As shown in Table 1, among nine desired properties, we can find that BigDataBench is more sophisticated than the other state-of-art big data benchmarks.
Table 1: The Differences of BigDataBench from Other Benchmarks Suites.
Spec[1] | App domains | Workload types | Workloads | Scalable data sets[2] | Diverse implem[3] | Multi- tenancy[4] | Subset[5] | Simulator[6] | |
BigDataBench | Y | Five | Six[7] | Thirty-three [8] | Eight[9] | Y | Y | Y | Y |
BigBench | Y | One | Three | Ten | Three | N | N | N | N |
CloudSuite | N | N/A | Two | Eight | Three | N | N | N | Y |
HiBench | N | N/A | Two | Ten | Three | N | N | N | N |
CALDA | Y | N/A | One | Five | N/A | Y | N | N | N |
YCSB | Y | N/A | One | Six | N/A | Y | N | N | N |
LinkBench | Y | N/A | One | Ten | N/A | Y | N | N | N |
AMP Benchmarks | Y | N/A | One | Four | N/A | Y | N | N | N |
[1]Spec is short for specification. Y indicates there is specification; N indicates there is no specification.
[2]Scalable data sets are extracted from real-world data sets. The number x indicates there are x scalable data sets are extracted from real-world data sets.
[3] For diverse implem, Y indicates for the same workload specification, diverse implementations using competitive techniques are provided. N indicates for the same workload specification, only a few implementations are provided.
[4] For multi-tenancy version, Y indicates that the multi-tenancy version is provided. N indicates that the multi-tenancy version is not provided.
[5] For subset, Y indicates that a subset of benchmarks is provided. N indicates that there is no subset of benchmarks. For example, an architecture subset is provided for BigDataBench.
[6]For simulator, Y indicates that the simulator version is provided. e.g., MARSSx86, gem5, and Simics versions are provided for BigDataBench. N indicates that the simulator version is not provided.
[7]The six workloads types include Streaming, Offline Analytics, Cloud OLTP, DW, Graph and Online Service
[8]There are 42 workloads in the specification. We have implemented 34 workloads.
[9]Eight real data sets are scalable, while the other seven ones are undergoing development.
What’s New?
BigDataBench 3.2 adds graph and streaming frameworks, and provides Flink implementations. Currently, BigDataBench includes 15 real-world data sets, and 34 big data workloads. We also release the multi-tenancy version for multi-user or multi-application scenarios and the simulator versions—MARSSx86, gem5 and Simics—for architecture communities.
Methodology
Figure 1 summarizes the benchmarking methodology in BigDataBench. Overall,it includes five steps: investigating and choosing important application domains; identifying typical workloads and data sets; proposing big data benchmarks specifications; providing diverse implementations using competitive techniques; mixing different workloads to assemble multi-tenancy workloads or subsetting big data benchmarks.
Figure 1 BigDataBench Benchmarking Methodology.
Benchmarks
BigDataBench is in fast expansion and evolution. Currently, we proposed benchmarks specifications modelling five typical application domains. The current version BigDataBench 3.2 includes 14 real-world data sets and 33 big data workloads. Table 2 summarizes the real-world data sets and scalable data generation tools included into BigDataBench 3.2, covering the whole spectrum of data types, including structured, semi-structured, and unstructured data, and different data sources, including text, graph, image, audio, video and table data. Table 3 presents BigDataBench from perspectives of application domains, workloads, workload types, data sets, software stacks. For some end users, they may just pay attention to big data application of a specific type. For example, they want to perform an apples-to- apples comparison of software stacks for offline analytics. They only need to choose benchmarks with the type of offline analytics. But if the users want to measure or compare big data systems and architecture, we suggest they cover all benchmarks.
Table 2.The summary of data sets and data generation tools
DATA SETS | DATA SIZE | SCALABLE DATA SET | |
---|---|---|---|
1 | Wikipedia Entries | 4,300,000 English articles(unstructured text) | Text Generator of BDGS |
2 | Amazon Movie Reviews | 7,911,684 reviews(semi-structured text) | Text Generator of BDGS |
3 | Google Web Graph | 875713 nodes, 5105039 edges(unstructured graph) | Graph Generator of BDGS |
4 | Facebook Social Network | 4039 nodes, 88234 edges (unstructured graph) | Graph Generator of BDGS |
5 | E-commerce Transaction Data | table1:4 columns,38658 rows.
table2: 6columns, 242735 rows(structured table) |
Table Generator of BDGS |
6 | ProfSearch Person Resumes | 278956 resumes(semi-structured table) | Table Generator of BDGS |
7 | ImageNet | ILSVRC2014 DET image dataset(unstructured image) | Ongoing development |
8 | English broadcasting audio files | Sampled at 16 kHz, 16-bit linear sampling(unstructured audio) | Ongoing development |
9 | DVD Input Streams | 110 input streams,resolution:704*480(unstructured video) | Ongoing development |
10 | Image scene | 39 image scene description files(unstructured text) | Ongoing development |
11 | Genome sequence data | Cfa data format(unstructured text) | 4 volumes of data sets |
12 | Assembly of the human genome | Fa data format(unstructured text) | 4 volumes of data sets |
13 | SoGou Data | the corpus and search query data from So-Gou Labs(unstructured text) | Ongoing development |
14 | MNIST | handwritten digits database which has 60,000 training examples and 10,000 test examples(unstructured image) | Ongoing development |
15 | MovieLens Dataset | User’s score data for movies, which has 9,518,231 training examples and 386,835 test examples(semi-structured text) | Ongoing development |
Table 3. The summary of the implemented workloads in BigDataBench 3.2
Domains |
Operations or Algorithm |
Types |
Data Sets |
Software Stacks |
Spec ID |
Search Engine
|
Grep |
Offline analytics |
Wikipedia Entries |
Hadoop, Spark, Flink, MPI |
W1-1 |
Streaming |
Random Generate |
Spark Streaming |
W1-1 |
||
WordCount |
Offline analytics |
Wikipedia Entries |
Hadoop, Spark, Flink, MPI |
W1-2 |
|
Index |
Offline analytics |
Wikipedia Entries |
Hadoop, Spark, MPI |
W1-4 |
|
PageRank |
Offline analytics |
Google Web Graph |
Hadoop, Spark, Flink, MPI |
W1-5 |
|
Nutch Server |
Online Service |
SoGou Data |
Nutch |
W1-6-1 |
|
Search |
Streaming |
Search Data |
JStorm |
W1-6-2 |
|
Sort |
Offline analytics |
Wikipedia Entries |
Hadoop, Spark, MPI |
W1-7 |
|
Read |
Cloud OLTP |
ProfSearch Resumes |
HBase, MySQL |
W1-11-1 |
|
Write |
Cloud OLTP |
ProfSearch Resumes |
HBase, MySQL |
W1-11-2 |
|
Scan |
Cloud OLTP |
ProfSearch Resumes |
HBase, MySQL |
W1-11-3 |
|
Social Networks
|
Rolling Top Words |
Streaming |
Random Generate |
JStorm, Spark Streaming |
W2-1 |
CC |
Graph |
Facebook Social Network |
Hadoop, Spark, MPI, GraphX, GraphLab, Flink Gelly |
W2-8-1 |
|
Kmeans |
Streaming |
Randon Generate |
Spark Streaming |
W2-8-2 |
|
Offline analytics |
Facebook Social Network |
Hadoop, Spark, Flink, MPI |
W2-8-2 |
||
Label Propagation |
Graph |
Facebook Social Network |
GraphX, GraphLab, Flink Gelly |
W2-8-3 |
|
Triangle Count |
Graph |
Facebook Social Network |
GraphX, GraphLab, Flink Gelly |
W2-8-4 |
|
BFS |
Graph |
Self Generating by the program(MPI) ; Facebook Social Network |
GraphX, GraphLab, Flink Gelly, MPI |
W2-9 |
|
E-commerce
|
Select Query |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-1 |
Aggregation |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-2 |
|
Join Query |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-3 |
|
CF |
Streaming |
MovieLens Dataset |
JStorm |
W3-4-1 |
|
Offline Analytics |
Amazon Movie Review |
Hadoop, Spark, MPI |
W3-4-2 |
||
Bayes |
Offline Analytics |
Amazon Movie Review |
Hadoop, Spark, MPI |
W3-5 |
|
Project |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-1 |
|
Filter |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-2 |
|
Cross Product |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-3 |
|
Order By |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-4 |
|
Union |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-5 |
|
Difference |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-6 |
|
Aggregation |
Data Warehouse |
E-commerce Transaction Data |
Hive, Shark, Impala |
W3-6-7 |
|
Multimedia analytics
|
BasicMPEG |
Offline analytics |
Stream Data |
Libc |
W4-1 |
SIFT |
Offline analytics |
ImageNet |
MPI |
W4-2-1 |
|
DBN |
Offline analytics |
MNIST |
MPI |
W4-2-2 |
|
Speech Recognition |
Offline analytics |
Audio files |
MPI |
W4-3 |
|
Ray Tracing |
Offline analytics |
Scene description files |
MPI |
W4-4 |
|
Image Segmentation |
Offline analytics |
ImageNet |
MPI |
W4-5 |
|
Face Detection |
Offline analytics |
ImageNet |
MPI |
W4-6 |
|
Bioinformatics
|
Offline analytics |
Genome sequence Data |
Work Queue |
W5-1 |
|
Offline analytics |
Assembly of the human genome |
MPI |
W5-2 |
Evolution
As shown in Figure 2, the evolution of BigDataBench has gone through three major stages: At the first stage, we released three benchmarks suites, BigDataBench 1.0 (6 workloads from Search engine), DCBench 1.0 (11 workloads from data analytics), and CloudRank 1.0(mixed data analytics workloads).
At the second stage, we merged the previous three benchmark suites and release BigDataBench 2.0 , through investigating the top three important application domains from internet services in terms of the number of page views and daily visitors. BigDataBench 2.0 includes 6 real-world data sets, and 19 big data workloads with different implementations, covering six application scenarios: micro benchmarks, Cloud OLTP, relational query, search engine, social networks, and e-commerce. Moreover, BigDataBench 2.0 provides several big data generation tools–BDGS– to generate scalable big data, e.g, PB scale, from small-scale real-world data while preserving their original characteristics.
BigDataBench 3.0 is a multidisciplinary effort. It includes 6 real-world, 2 synthetic data sets, and 32 big data workloads, covering micro and application benchmarks from typical application domains, e. g., search engine, social networks, and e-commerce. As to generating representative and variety of big data workloads, BigDataBench 3.0 focuses on units of computation that frequently appear in Cloud OLTP, OLAP, interactive and offline analytics.
Figure 2: BigDataBench Evolution
Previous releases
BigDataBench 3.1 http://prof.ict.ac.cn/BigDataBench/old/3.1/
BigDataBench 3.0 http://prof.ict.ac.cn/BigDataBench/old/3.0/
BigDataBench 2.0 http://prof.ict.ac.cn/BigDataBench/old/2.0/
BigDataBench 1.0 http://prof.ict.ac.cn/BigDataBench/old/1.0/
DCBench 1.0 http://prof.ict.ac.cn/DCBench/
CloudRank 1.0 http://prof.ict.ac.cn/CloudRank/
Handbook
Handbook of BigDataBench [BigDataBench-handbook]
Q &A
More questions & answers are available from the handbook of BigDataBench.
Contacts (Email)
- lijingwei@mail.bafst.com
- zhanjianfeng@ict.ac.cn
- wl@ncic.ac.cn
People
- Prof. Jianfeng Zhan, ICT, CAS
- Dr. Lei Wang, ICT, CAS
- Wanling Gao, ICT, CAS
- Chunjie Luo, ICT, CAS
- Qiang Yang, BAFST
- Jingwei Li, BAFST
- Xinhui Tian, ICT, CAS
- Dr. Gang Lu, BAFST
- Xinlong Lin, BAFST
- Rui Ren, ICT, CAS
- Dr, Rui Han, ICT, CAS
-
Alumni
- Dr. Zhen Jia
- Hainan Ye
- Dr. Yingjie Shi
- Zijian Ming
- Yuanqing Guo
- Dr. Yuqing Zhu
License
BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our BigDataBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our BigDataBench you must understand and comply with their licenses. Software developed externally (not by BigDataBench group)
- Boost: http://www.boost.org/doc/libs/1_43_0/more/getting_started/unix-variants.html
- GCC: http://gcc.gnu.org/releases.html
- GSL: http://www.gnu.org/software/gsl/
- Graph500 : http://www.graph500.org/referencecode
- Hadoop: http://www.apache.org/licenses/LICENSE-2.0
- HBase : http://hbase.apache.org/
- Hive: http://hive.apache.org/
- Impala: https://github.com/cloudera/impala
- MySQL : http://www.mysql.com/
- Mahout: http://www.apache.org/licenses/LICENSE-2.0
- Mpich: http://www.mpich.org/
- Nutch : http://www.apache.org/licenses/LICENSE-2.0
- Parallel Boost Graph Library : http://www.osl.iu.edu/research/pbgl/software/
- Spark: http://spark.incubator.apache.org/
- Shark: http://shark.cs.berkeley.edu/
- Scala: http://www.scala-lang.org/download/2.9.3.html
- Zookeeper: http://zookeeper.apache.org/
Software developed internally (by BigDataBench group) BigDataBench_3.2 License BigDataBench_3.2 Suite Copyright (c) 2013-2015, ICT Chinese Academy of Sciences All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistribution of source code must comply with the license and notice disclaimers
- Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.