Overview

News: We are organizing workshop on Big data Benchmarks, Performance Optimization, and Emerging Hardware with VLDB 2014.

We will upgrade big data analytics benchmarks and report the performance number of Shark, Impala, and Hive soon.

BigDataBench is a big data benchmark suite from internet services (Please refer to our summary paper and our presentation (.ppt) at HPCA 2014). It includes six real-world data sets, and nineteen big data workloads, covering six application scenarios: micro benchmarks, Cloud "OLTP", relational query, search engine, social networks, and e-commerce. In generating representative and variety of big data workloads, BigDataBench features an abstracted set of Operations and Patterns for big data processing (Please refer to our DASFAA paper ). BigDataBench also provides several big data generation tools–BDGS– to generate scalable big data, e.g, PB scale, from small-scale real-world data while preserving their original characteristics. For the same workloads, different implementations are provided. For example, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, and etc. Several users used BigDataBench for different purposes, e. g., workload characterization and evaluating hardware systems.

License

BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Aapche License, Version 2.0. Please use all files in compliance with the License.

Benchmarks

Covering six application scenarios, BigDataBench 2.0 includes six real-world data sets and nineteen big data workloads.

Table 1: The Summary of Six Data Sets

No. data sets data size

1

Wikipedia Entries

4,300,000 English articles

2

Amazon Movie Reviews

7,911,684 reviews

3

Google Web Graph

875713 nodes, 5105039 edges

4

Facebook Social Network

4039 nodes, 88234 edges

5

E-commerce Transaction Data

table1: 4 columns, 38658 rows.

table2: 6 columns, 242735 rows

6

ProfSearch Person Resumes

278956 resumes


Table 2: The Summary of BigDataBench

Application Scenarios

Operations & Algorithm

Data Type

Data Source

Software stack

Application type

Micro Benchmarks

Sort

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

Grep

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

WordCount

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

BFS

Unstructured

Graph

MapReduce, Spark, MPI

Offline Analytics

Basic Datastore Operations ("Cloud OLTP")

Read

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Service

Write

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Services

Scan

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Services

Relational Query

Select Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Aggregate Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Join Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Search Engine

Nutch Server

Structured

Table

Hadoop

Online Services

PageRank

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

Index

Unstructured

Text

Hadoop, MPI, Spark

Offline Analytics

Social Network

Olio Server

Structured

Table

MySQL

Online Service

K-means

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

Connected Com-ponents

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

E-commerce

Rubis Server

Structured

Table

MySQL

Online Service

Collaborative Filtering

Unstructured

Text

Hadoop, MPI, Spark

Offline Analytics

Naive Bayes

Unstructrued

Text

Hadoop, MPI, Spark

Offline Analytics

Downloads

Downloading user manuals

BigDataBench 2.2 user manual [Doc]

Downloading raw data sets

No.

Data sets

Description

1

Wikipedia Entries

Wiki.bz2

Size:[9.8GB]

2

Amazon Movie Reviews

AMR.tar.gz

Size:[3.1GB]

3

Google Web Graph

GWG.bz2

Size:[23MB]

4

Facebook Social Network

FSN.bz2

Size:[220KB]

5

E-commerce Transaction Data

ECT.tar.gz

(Available soon)

6

ProfSearch Person Resumes

PPR.tar.gz

(Available soon)

Downloading software packages

We provide two options: download full software package one time or components one by one. Please note that you need download and deploy prerequisite software packages before using BigDataBench workloads. Please refer to the user manual.

The following packets should be installed firstly, and the running platform is Linux.

Software Version Download
Hadoop 1.0.2 http://hadoop.apache.org/#Download+Hadoop
HBase 0.94.5 http://www.apache.org/dyn/closer.cgi/hbase/
Cassandra 1.2.3 http://cassandra.apache.org/download/
MongoDB 2.4.1 http://www.mongodb.org/downloads
Mahout 0.8 https://cwiki.apache.org/confluence/display/MAHOUT/Downloads
Hive 0.9.0 https://cwiki.apache.org/confluence/display/Hive/GettingStarted #GettingStarted-InstallationandConfiguration
Spark 0.8.0 http://spark.incubator.apache.org/
Impala 1.1.1 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_install.html
MPICH 2.0 http://www.mpich.org/downloads/
Boost 1_43_0 http://www.boost.org/doc/libs/1_43_0/more/getting_started/unix-variants.html
Scala 2.9.3 http://www.scala-lang.org/download/2.9.3.html
GCC 4.8.2 http://gcc.gnu.org/releases.html
GSL 1.16 http://www.gnu.org/software/gsl/

Full downloading

Full software packages of different implementations are avaiable from the following links:

Hadoop+Hive version: BigDataBench_V2.2.tar.gz

MPI version: BigDataBench_MPI_V2.2.tar

Spark version: BigDataBench_Sprak_V2.2.tar

DataMPI version: BigDataBench_DataMPI (Available soon)

Separate downloading

You may downloads different components of BigDataBench from the following Tables.

Table 3: BDGS: Big Data Generator Suite in BigDataBench

Name

Description

BDGS generates big data on the basis of six raw data sets

Text

BigDataGeneratorSuite.tar.gz

Size: 9.82MB

Graph

Table

 

Table 4: BigDataBench workloads. Please note that each shell script for generating data and running workloads is included in the distribution.

Workloads

Name

Description

Micro Benchmarks

Sort

MicroBenchmarks

Hadoop , MPI, Spark

Size: 10KB

Grep

WordCount

BFS

MPI Version: BFS_MPI.tar.gz

Size: 4.7MB

Basic Datastore Operations ("Cloud OLTP")

Read

BasicDatastoreOperations.tar.gz

Size: 93.7MB

Write

Scan

Relational Query

Select Query

Hive Version: RelationalQuery.tar.gz

Size: 1.9KB

Aggregate Query

Join Query

Search Engine

Nutch Server

Nutch Version:Nutch_Server.tar.gz; Size: 178MB

User Manual: [PDF]

Index and Segment data: [Data]; Size: 4.98GB

Indexing

SearchEngine.tar.gz

Hadoop, MPI, Spark

Size: 177.9MB

PageRank

SNS

Olio Server

Olio.tar.gz

Size: 237MB

Kmeans

SNS.tar.gz

Hadoop, MPI, Spark

Size: 5.81MB

Connected component

E-commerce

Rubis Server

Rubis.tar.gz (Available soon)

Collaborative Filtering

E-commerce.tar.gz

Hadoop, MPI, Spark

Size: 4.32MB

Naive Bayes

Publications

For Citations

If you need a citation for BigDataBench, please cite the following papers related with your work:

BigDataBench: a Big Data Benchmark Suite from Internet Services. [PDF]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA.

BigOP: generating comprehensive big data workloads as a benchmarking framework. [pdf]

Yuqing Zhu, Jianfeng Zhan, ChuliangWeng, Raghunath Nambiar, Jingchao Zhang,
Xingzhen Chen, and Lei Wang. The 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014), 2014.

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. [PDF]

Zijian Ming, Chunjie Luo, Wanling Gao, Rui Han, Qiang Yang, Lei Wang, and Jianfeng Zhan. Lecture note in computer sciences, extended version for the fourth workshop on big data benchmarking, 2014.

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu. Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th International Symposium on Computer Architecture, May 2013.

Presentations:

BigDataBench: A big data benchmark suite

Jianfeng Zhan, Professor, invited talk at Third Workshop on Big Data Benchmarking. 2013.

Research Highlights of BPOE

Jianfeng Zhan, Professor, invited talk at Forth Workshop on Big Data Benchmarking 2013. [PPT]

BigDataBench: Benchmarking Big Data Systems

Yingjie Shi, Assistant professor, invited talk at First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, In conjunction with IEEE Big Data 2013.[PPT]

Benchmarking Datacenter and Big Data Systems

Jianfeng Zhan, professor, invited talk at Third Workshop on Big Data Benchmarking 2013 [PPT]

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

Jianfeng Zhan, professor, Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th ISCA 2013. [PPT]

BigDataBench: a Benchmark Suite for Big Data Application

Wanling Gao, Ph.D candidate, the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) Tutorial [PPT]

The Implications of Diverse and Scalable Data Sets in Benchmarking Big Data Systems

Zhen Jia, Ph.D candidate, Second Workshop on Big Data Benchmarking 2012 [PPT]

News

2014-3-14, MPI and Spark implementations are released.

2014-2-27, Prof. Jianfeng Zhan gave an open talk at Ohio State University.

2014-2-18, Prof. Jianfeng Zhan gave a talk about BigDataBench at HPCA 2014.

2014-1-15, Our another paper about BigDataBench is accepted by a data management conference– DASFAA 2014.

2014-1-4, Our paper about BigDataBench 2.0 is accepted by HPCA 2014 [PDF]

2013-11-22, Prof. Jianfeng Zhan gave a talk at IBM Austin Research Laboratory.

2013-10-9, Professor Jianfeng Zhan gave a invited talk at Fourth Workshop on Big Data Benchmarking

2013-10-8, BigDatabench 2.0 Realeased

2013-10-8, Assistant professor Yingjie Shi gave a invited talk at First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware.

2013-6-25, professor Jianfeng Zhan gave a presentation at the ASBD 2013 in conjunction with The 40th ISCA. [PPT]

2013-6-25, BigDatabench 1.0 Realeased

2013-6-16, Professor Jianfeng Zhan gave a invited talk at Third Workshop on Big Data Benchmarking [PPT]

2013-2-24, Ph.D candidate Wanling Gao gave a presentation at the HPCA 2013 Tutorial [PPT]

 

Users

(Please write to zhanjianfeng@ict.ac.cn if you have suggestions of other papers or would like to have your publications included here. )

Comments to BigDataBench

(1) Guojie Li, Big Data Challenges to Computer Systems (in Chinese), Communication of CCF, Vol. 9, December, 2012.

(2) Nicole Hemsoth (HPCWire Editor), Toward Comprehensive Big Data Benchmarking. 2014.1.3

 

Research Projects Using BigDataBench

(1) DataMPI, Prof. Zhiwei Xu, Fan Liang(ICT, Chinese Academy of Sciences), Dr. Xiaoyi Lu (Ohio State University)

Research Papers Using BigDataBench

1. Workload Characterization

Wei Wei, Dejun Jiang, Jin Xiong, Mingyu Chen. Exploring Opportunities for Non-Volatile Memories in Big Data Applications. BPOE-4, in conjunction with ASPLOS 2014.

Fengfeng Pan, YinliangYue, and Jin Xiong. I/O Characterization of Big Data Workloads in Data Centers. BPOE-4, in conjunction with ASPLOS 2014.

Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo. Characterizing data analysis workloads in data centers. [PDF] [Slides]. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award)

Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., … & Xu, C. (2013, October). A characterization of big data benchmarks. In Big Data, 2013 IEEE International Conference on (pp. 118-125). IEEE.

2. Evaluating and Optimizating Big Data Hardware Systems

Quan, J., Shi, Y., Zhao, M., & Yang, W. (2013, October). The implications from benchmarking three big data systems. [PDF]. In Big Data, 2013 IEEE International Conference on (pp. 31-38). IEEE.

3. Performance diagnosis and Optimization of Big Data Systems

Chen, P., Qi, Y., Li, X., & Su, L. (2013, October). An ensemble MIC-based approach for performance diagnosis in big data platform. [PDF]. In Big Data, 2013 IEEE International Conference on (pp. 78-85). IEEE.

4. Evaluating and Optimizating Big Data Systems Energy Eficiency

Zhou, R., Shi, Y., & Zhu, C. (2013, October). AxPUE: Application level metrics for power usage effectiveness in data centers. [PDF]. In Big Data, 2013 IEEE International Conference on (pp. 110-117). IEEE.

5. Evaluation of Virtualization Systems

Ning, F., Weng, C., & Luo, Y. (2013, October). Virtualization I/O optimization based on shared memory. [PDF]. In Big Data, 2013 IEEE International Conference on (pp. 70-77). IEEE.

6. Evaluating Programming Systems

Liang Fan, Feng Chen, Lu Xiaoyi and Xu Zhiwei. Performance Benefits of DataMPI: A Case Study with BigDataBench. BPOE-4, in conjunction with ASPLOS 2014.

Other Papers Citing BigDataBench

1. Data Center Resource Managment

Jianfeng Zhan; Lei Wang; Xiaona Li; Weisong Shi; Chuliang Weng; Wenyao Zhang; Xiutao Zang, “Cost-Aware Cooperative Resource Provisioning for Heterogeneous Workloads in Data Centers,” Computers, IEEE Transactions on , vol.62, no.11, pp.2155,2168, Nov. 2013

2. Hadoop Systems Evaluation and Optimization

Liu, S., Xu, J., Liu, Z., & Liu, X. (2013, October). Evaluating task scheduling in hadoop-based cloud systems. In Big Data, 2013 IEEE International Conference on (pp. 47-53). IEEE.

 

People

Contact Us

Email:

zhanjianfeng@ict.ac.cn

wl@ncic.ac.cn

ICTBench

http://prof.ict.ac.cn/ICTBench

Workshop on Big Data Benchmarks

http://prof.ict.ac.cn/bpoe