BigDataBench is a big data benchmark suite from web search engines. The first release (BigDataBench 1.0) provides 6 representative big data applications from search engines which are the most important domain in Internet services in terms of the number of page views and daily visitors. It also provides an innovative data generation tool to generate scalable volumes of big data from a small-scale of real data preserving semantics and locality of the real data. The data sets in BigDataBench are generated by the tool. Users can also combine other applications in BigDataBench according to their own requirements.
The big data benchmark suite with all its applications and input sets is available as open source free of charge. It is available for researchers interested in pursuing research in the field of big data application.
In the era of information explosion, more and more data are produced. People are producing and sharing data continuously. The pressure of evaluating and comparing performance, energy efficiency, and cost effectiveness of big data systems rises. However, little benchmark suite for big data applications exists. In this regard, we propose a new big data benchmark suite—BigDataBench.
BigDataBench differs from other big data benchmark suites in the following ways:
Incremental Approach: Firstly, we investigate application domains and single out the most important one—Search Engine, considering its daily visitors and pages views. Secondly, we choose typical workloads from search engines as candidates of BigDataBench.
Innovative Data Generation Methodology and Tool: We provide a data generation tool to overcome difficulties of obtaining real big data. This tool generates big data based on small-scale data, preserving key characteristics of real data.
Variety of Workloads: BigDataBench consists of six representative workloads, including analysis workload and service workload. They have different characteristics in terms of computation, memory and I/O access patterns.
The current version of the suite contains the following 6 workloads from web search engines and a data generation tool:
BigDataBench is made available as open source. All of the software components are governed by their own licensing terms. Users intending to use BigDataBench are required to fully understand abide by the licensing terms of the various components.
If you want all content in BigDataBench and large file sizes do not bother you, you can use the following link to download the whole distribution.
As an alternative to the full benchmark suite, we also provide the distribution broken up into several smaller archives which contain only parts of BigDataBench.
Data generation tool
BigDataBench 1.0 Data Generation Tool
Source code of data generation tool
Size: 9.82M
Data analysis applications and sample seed data
BigDataBench 1.0 Sample Seed Data ( for five data analysis applications )
Test inputs from wikipedia (20 Newsgroups data set)
Size:9.8M
BigDataBench 1.0 Sort_Grep_Wordcount_NaiveBayes
Source code four workloads
Size:20K
BigDataBench 1.0 SVM
Source code of svm workload and training model of sample seed data
Size:79.6M
Service workloads
BigDataBench 1.0 Nutch Server
Source code of nutch workload and indexes
Size:202M
For information of how to run BigDataBench, users can use the following link to download user’s manual.
BigDataBench: a Big Data Benchmark Suite from Web Search Engines [PDF][BIBTEX]
Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu. The Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th International Symposium on Computer Architecture, May 2013.
The Implications of Diverse and Scalable Data Sets in Benchmarking Big Data Systems[pdf][BIBTEX]
Zhen Jia, Lei Wang, Wanling Gao, Jianfeng Zhan, and Lixin Zhang. Second Workshop on Big Data Benchmarking. 2012.
A presentation at Second Workshop on Big Data Benchmarking [PPT]
A tutorial at HPCA 2013 [PPT]
A presentation at ISCA-40's ASBD workshop [PPT]
A invited talk at Third Workshop on Big Data Benchmarking [PPT]
Email:
wl@ncic.ac.cn
gaowanling@ict.ac.cn