Overview
News: We are organizing workshop on Big data Benchmarks, Performance Optimization, and Emerging Hardware with VLDB 2014.
We will upgrade big data analytics benchmarks and report the performance number of Shark, Impala, and Hive soon.
BigDataBench is a big data benchmark suite from internet services (Please refer to our summary paper and our presentation (.ppt) at HPCA 2014). It includes six real-world data sets, and nineteen big data workloads, covering six application scenarios: micro benchmarks, Cloud "OLTP", relational query, search engine, social networks, and e-commerce. In generating representative and variety of big data workloads, BigDataBench features an abstracted set of Operations and Patterns for big data processing (Please refer to our DASFAA paper ). BigDataBench also provides several big data generation tools–BDGS– to generate scalable big data, e.g, PB scale, from small-scale real-world data while preserving their original characteristics. For the same workloads, different implementations are provided. For example, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, and etc. Several users used BigDataBench for different purposes, e. g., workload characterization and evaluating hardware systems.
License
BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Aapche License, Version 2.0. Please use all files in compliance with the License.
Benchmarks
Covering six application scenarios, BigDataBench 2.0 includes six real-world data sets and nineteen big data workloads.
Table 1: The Summary of Six Data Sets
No. | data sets | data size |
1 |
Wikipedia Entries |
4,300,000 English articles |
2 |
Amazon Movie Reviews |
7,911,684 reviews |
3 |
Google Web Graph |
875713 nodes, 5105039 edges |
4 |
Facebook Social Network |
4039 nodes, 88234 edges |
5 |
E-commerce Transaction Data |
table1: 4 columns, 38658 rows. table2: 6 columns, 242735 rows |
6 |
ProfSearch Person Resumes |
278956 resumes |
Table 2: The Summary of BigDataBench
Application Scenarios |
Operations & Algorithm |
Data Type |
Data Source |
Software stack |
Application type |
Micro Benchmarks |
Sort |
Unstructured |
Text |
MapReduce, Spark, MPI |
Offline Analytics |
Grep |
Unstructured |
Text |
MapReduce, Spark, MPI |
Offline Analytics |
|
WordCount |
Unstructured |
Text |
MapReduce, Spark, MPI |
Offline Analytics |
|
BFS |
Unstructured |
Graph |
MapReduce, Spark, MPI |
Offline Analytics |
|
Basic Datastore Operations ("Cloud OLTP") |
Read |
Semi-structured |
Table |
Hbase, Cassandra, MongoDB, MySQL |
Online Service |
Write |
Semi-structured |
Table |
Hbase, Cassandra, MongoDB, MySQL |
Online Services |
|
Scan |
Semi-structured |
Table |
Hbase, Cassandra, MongoDB, MySQL |
Online Services |
|
Relational Query |
Select Query |
Structured |
Table |
Impala, Shark, MySQL, Hive |
Realtime Analytics |
Aggregate Query |
Structured |
Table |
Impala, Shark, MySQL, Hive |
Realtime Analytics |
|
Join Query |
Structured |
Table |
Impala, Shark, MySQL, Hive |
Realtime Analytics |
|
Search Engine |
Nutch Server |
Structured |
Table |
Hadoop |
Online Services |
PageRank |
Unstructured |
Graph |
Hadoop, MPI, Spark |
Offline Analytics |
|
Index |
Unstructured |
Text |
Hadoop, MPI, Spark |
Offline Analytics |
|
Social Network |
Olio Server |
Structured |
Table |
MySQL |
Online Service |
K-means |
Unstructured |
Graph |
Hadoop, MPI, Spark |
Offline Analytics |
|
Connected Com-ponents |
Unstructured |
Graph |
Hadoop, MPI, Spark |
Offline Analytics |
|
E-commerce |
Rubis Server |
Structured |
Table |
MySQL |
Online Service |
Collaborative Filtering |
Unstructured |
Text |
Hadoop, MPI, Spark |
Offline Analytics |
|
Naive Bayes |
Unstructrued |
Text |
Hadoop, MPI, Spark |
Offline Analytics |
Downloads
Downloading user manuals
BigDataBench 2.2 user manual [Doc]
Downloading raw data sets
No. |
Data sets |
Description |
1 |
Size:[9.8GB] |
|
2 |
Size:[3.1GB] |
|
3 |
Size:[23MB] |
|
4 |
Size:[220KB] |
|
5 |
E-commerce Transaction Data |
ECT.tar.gz (Available soon) |
6 |
ProfSearch Person Resumes |
PPR.tar.gz (Available soon) |
Downloading software packages
We provide two options: download full software package one time or components one by one. Please note that you need download and deploy prerequisite software packages before using BigDataBench workloads. Please refer to the user manual.
The following packets should be installed firstly, and the running platform is Linux.
Software | Version | Download |
Hadoop | 1.0.2 | http://hadoop.apache.org/#Download+Hadoop |
HBase | 0.94.5 | http://www.apache.org/dyn/closer.cgi/hbase/ |
Cassandra | 1.2.3 | http://cassandra.apache.org/download/ |
MongoDB | 2.4.1 | http://www.mongodb.org/downloads |
Mahout | 0.8 | https://cwiki.apache.org/confluence/display/MAHOUT/Downloads |
Hive | 0.9.0 | https://cwiki.apache.org/confluence/display/Hive/GettingStarted #GettingStarted-InstallationandConfiguration |
Spark | 0.8.0 | http://spark.incubator.apache.org/ |
Impala | 1.1.1 | http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_install.html |
MPICH | 2.0 | http://www.mpich.org/downloads/ |
Boost | 1_43_0 | http://www.boost.org/doc/libs/1_43_0/more/getting_started/unix-variants.html |
Scala | 2.9.3 | http://www.scala-lang.org/download/2.9.3.html |
GCC | 4.8.2 | http://gcc.gnu.org/releases.html |
GSL | 1.16 | http://www.gnu.org/software/gsl/ |
Full downloading
Full software packages of different implementations are avaiable from the following links:
Hadoop+Hive version: BigDataBench_V2.2.tar.gz
MPI version: BigDataBench_MPI_V2.2.tar
Spark version: BigDataBench_Sprak_V2.2.tar
DataMPI version: BigDataBench_DataMPI (Available soon)
Separate downloading
You may downloads different components of BigDataBench from the following Tables.
Table 3: BDGS: Big Data Generator Suite in BigDataBench
|
Name |
Description |
BDGS generates big data on the basis of six raw data sets |
Text |
Size: 9.82MB |
Graph |
||
Table |
Table 4: BigDataBench workloads. Please note that each shell script for generating data and running workloads is included in the distribution.
Workloads |
Name |
Description |
Micro Benchmarks |
Sort |
MicroBenchmarks Size: 10KB |
Grep |
||
WordCount |
||
BFS |
MPI Version: BFS_MPI.tar.gz Size: 4.7MB |
|
Basic Datastore Operations ("Cloud OLTP") |
Read |
BasicDatastoreOperations.tar.gz Size: 93.7MB |
Write |
||
Scan |
||
Relational Query |
Select Query |
Hive Version: RelationalQuery.tar.gz Size: 1.9KB |
Aggregate Query |
||
Join Query |
||
Search Engine |
Nutch Server |
Nutch Version:Nutch_Server.tar.gz; Size: 178MB User Manual: [PDF] Index and Segment data: [Data]; Size: 4.98GB |
Indexing |
SearchEngine.tar.gz Size: 177.9MB |
|
PageRank |
||
SNS |
Size: 237MB |
|
Kmeans |
SNS.tar.gz Size: 5.81MB |
|
Connected component |
||
E-commerce |
Rubis.tar.gz (Available soon) |
|
Collaborative Filtering |
E-commerce.tar.gz Size: 4.32MB |
|
Naive Bayes |
Publications
For Citations
If you need a citation for BigDataBench, please cite the following papers related with your work:
BigDataBench: a Big Data Benchmark Suite from Internet Services. [PDF]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA.
BigOP: generating comprehensive big data workloads as a benchmarking framework. [pdf]
Yuqing Zhu, Jianfeng Zhan, ChuliangWeng, Raghunath Nambiar, Jingchao Zhang,
Xingzhen Chen, and Lei Wang. The 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014), 2014.
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. [PDF]
Zijian Ming, Chunjie Luo, Wanling Gao, Rui Han, Qiang Yang, Lei Wang, and Jianfeng Zhan. Lecture note in computer sciences, extended version for the fourth workshop on big data benchmarking, 2014.
BigDataBench: a Big Data Benchmark Suite from Web Search Engines
Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu. Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th International Symposium on Computer Architecture, May 2013.
Presentations:
BigDataBench: A big data benchmark suite
Jianfeng Zhan, Professor, invited talk at Third Workshop on Big Data Benchmarking. 2013.
Research Highlights of BPOE
Jianfeng Zhan, Professor, invited talk at Forth Workshop on Big Data Benchmarking 2013. [PPT]
BigDataBench: Benchmarking Big Data Systems
Yingjie Shi, Assistant professor, invited talk at First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, In conjunction with IEEE Big Data 2013.[PPT]
Benchmarking Datacenter and Big Data Systems
Jianfeng Zhan, professor, invited talk at Third Workshop on Big Data Benchmarking 2013 [PPT]
BigDataBench: a Big Data Benchmark Suite from Web Search Engines
Jianfeng Zhan, professor, Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th ISCA 2013. [PPT]
BigDataBench: a Benchmark Suite for Big Data Application
Wanling Gao, Ph.D candidate, the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) Tutorial [PPT]
The Implications of Diverse and Scalable Data Sets in Benchmarking Big Data Systems
Zhen Jia, Ph.D candidate, Second Workshop on Big Data Benchmarking 2012 [PPT]
News
2014-3-14, MPI and Spark implementations are released.
2014-2-27, Prof. Jianfeng Zhan gave an open talk at Ohio State University.
2014-2-18, Prof. Jianfeng Zhan gave a talk about BigDataBench at HPCA 2014.
2014-1-15, Our another paper about BigDataBench is accepted by a data management conference– DASFAA 2014.
2014-1-4, Our paper about BigDataBench 2.0 is accepted by HPCA 2014 [PDF]
2013-11-22, Prof. Jianfeng Zhan gave a talk at IBM Austin Research Laboratory.
2013-10-9, Professor Jianfeng Zhan gave a invited talk at Fourth Workshop on Big Data Benchmarking
2013-10-8, BigDatabench 2.0 Realeased
2013-10-8, Assistant professor Yingjie Shi gave a invited talk at First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware.
2013-6-25, professor Jianfeng Zhan gave a presentation at the ASBD 2013 in conjunction with The 40th ISCA. [PPT]
2013-6-25, BigDatabench 1.0 Realeased
2013-6-16, Professor Jianfeng Zhan gave a invited talk at Third Workshop on Big Data Benchmarking [PPT]
2013-2-24, Ph.D candidate Wanling Gao gave a presentation at the HPCA 2013 Tutorial [PPT]
Users
(Please write to zhanjianfeng@ict.ac.cn if you have suggestions of other papers or would like to have your publications included here. )
Comments to BigDataBench
(1) Guojie Li, Big Data Challenges to Computer Systems (in Chinese), Communication of CCF, Vol. 9, December, 2012.
(2) Nicole Hemsoth (HPCWire Editor), Toward Comprehensive Big Data Benchmarking. 2014.1.3
Research Projects Using BigDataBench
(1) DataMPI, Prof. Zhiwei Xu, Fan Liang(ICT, Chinese Academy of Sciences), Dr. Xiaoyi Lu (Ohio State University)
Research Papers Using BigDataBench
1. Workload Characterization
Wei Wei, Dejun Jiang, Jin Xiong, Mingyu Chen. Exploring Opportunities for Non-Volatile Memories in Big Data Applications. BPOE-4, in conjunction with ASPLOS 2014.
Fengfeng Pan, YinliangYue, and Jin Xiong. I/O Characterization of Big Data Workloads in Data Centers. BPOE-4, in conjunction with ASPLOS 2014.
Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo. Characterizing data analysis workloads in data centers. [PDF] [Slides]. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award)
2. Evaluating and Optimizating Big Data Hardware Systems
3. Performance diagnosis and Optimization of Big Data Systems
4. Evaluating and Optimizating Big Data Systems Energy Eficiency
5. Evaluation of Virtualization Systems
6. Evaluating Programming Systems
Other Papers Citing BigDataBench
1. Data Center Resource Managment
2. Hadoop Systems Evaluation and Optimization
People
- Prof. Jianfeng Zhan
- Lei Wang
- Chunjie Luo
- Yuqing Zhu
- Qiang Yang
- Wanling Gao
- Zijian Ming
Contact Us
Email:
zhanjianfeng@ict.ac.cn
ICTBench
http://prof.ict.ac.cn/ICTBench