What is BigDataBench

Warning! BigDataBench 3.0 is available >>>here

BigDataBench is a big data benchmark suite from web search engines. The first release (BigDataBench 1.0) provides 6 representative big data applications from search engines which are the most important domain in Internet services in terms of the number of page views and daily visitors. It also provides an innovative data generation tool to generate scalable volumes of big data from a small-scale of real data preserving semantics and locality of the real data. The data sets in BigDataBench are generated by the tool. Users can also combine other applications in BigDataBench according to their own requirements.

Who can use BigDataBench

The big data benchmark suite with all its applications and input sets is available as open source free of charge. It is available for researchers interested in pursuing research in the field of big data application.


In the era of information explosion, more and more data are produced. People are producing and sharing data continuously. The pressure of evaluating and comparing performance, energy efficiency, and cost effectiveness of big data systems rises. However, little benchmark suite for big data applications exists. In this regard, we propose a new big data benchmark suite—BigDataBench.

Key Features

BigDataBench differs from other big data benchmark suites in the following ways:

Incremental Approach: Firstly, we investigate application domains and single out the most important one—Search Engine, considering its daily visitors and pages views. Secondly, we choose typical workloads from search engines as candidates of BigDataBench.

Innovative Data Generation Methodology and Tool: We provide a data generation tool to overcome difficulties of obtaining real big data. This tool generates big data based on small-scale data, preserving key characteristics of real data.

Variety of Workloads: BigDataBench consists of six representative workloads, including analysis workload and service workload. They have different characteristics in terms of computation, memory and I/O access patterns.

Benchmark Programs

The current version of the suite contains the following 6 workloads from web search engines and a data generation tool:

  • Sort—sort the input directory into the output directory;
  • Wordcount—reads text files and counts how often words occur;
  • Grep—extracts matching strings from text files and counts how many times they occurred;
  • Naïve Bayes—a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions;
  • Support Vector Machine—supervised learning models with associated learning algorithms that analyze data and recognize patterns;
  • Nutch Server--an open source web-search software project.

BigDataBench is made available as open source. All of the software components are governed by their own licensing terms. Users intending to use BigDataBench are required to fully understand abide by the licensing terms of the various components.

If you want all content in BigDataBench and large file sizes do not bother you, you can use the following link to download the whole distribution.

BigDataBench 1.0

As an alternative to the full benchmark suite, we also provide the distribution broken up into several smaller archives which contain only parts of BigDataBench.

  1. Data generation tool

    BigDataBench 1.0 Data Generation Tool

    Source code of data generation tool

    Size: 9.82M


  2. Data analysis applications and sample seed data

    BigDataBench 1.0 Sample Seed Data ( for five data analysis applications )

    Test inputs from wikipedia (20 Newsgroups data set)



    BigDataBench 1.0 Sort_Grep_Wordcount_NaiveBayes

    Source code four workloads



    BigDataBench 1.0 SVM

    Source code of svm workload and training model of sample seed data



  3. Service workloads

    BigDataBench 1.0 Nutch Server

    Source code of nutch workload and indexes



For information of how to run BigDataBench, users can use the following link to download user’s manual.

User's Manual of BigDataBench_V1.0.pdf

For Citations

Please cite the following paper if you need a citation for BigDataBench

BigDataBench: a Big Data Benchmark Suite from Web Search Engines [PDF][BIBTEX]

Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu Qiu. The Third Workshop on Architectures and Systems for Big Data(ASBD 2013) in conjunction with The 40th International Symposium on Computer Architecture, May 2013.

The Implications of Diverse and Scalable Data Sets in Benchmarking Big Data Systems[pdf][BIBTEX]

Zhen Jia, Lei Wang, Wanling Gao, Jianfeng Zhan, and Lixin Zhang. Second Workshop on Big Data Benchmarking. 2012.

Research papers using BigDataBench

We have given the following presentation about BigDataBench:

A presentation at Second Workshop on Big Data Benchmarking [PPT]

A tutorial at HPCA 2013 [PPT]

A presentation at ISCA-40's ASBD workshop [PPT]

A invited talk at Third Workshop on Big Data Benchmarking [PPT]




  • BigDataBench 1.0 Beta Release
  • A presentation at ISCA-40's ASBD workshop
  • A tutorial at HPCA 2013(2013-02-24)
  • A presentation at Second Workshop on Big Data Benchmarking(2012-12-17)
  • A invited talk at Third Workshop on Big Data Benchmarking