Overview

News:  User group on Linkedin.  BigDataBench subset and simulator version for architecture communities released. Multi-media data sets and workloads available soon.   A tutorial on BigDataBench at Micro 2014  (December 13-17, 2014, Cambridge, UK)

Summary

As a multi-discipline—e.g., system, architecture, and data management—research effort, BigDataBench is a big data benchmark suite (please refer to our summary paper and presentation at HPCA 2014). The current version is BigDataBench 3.0.  It includes 6 real-world and 2 synthetic data sets, and 32 big data workloads, covering micro and application benchmarks from typical application domains, e. g., search engine, social networks, and e-commerce. In generating representative and variety of big data workloads, BigDataBench focuses on units of computation that frequently appear in Cloud “OLTP”, OLAP, interactive and offline analytics (Please refer to our DASFAA paper). BigDataBench also provides several (parallel) big data generation tools–BDGS– to generate scalable big data, e.g. PB scale, from small-scale real-world data while preserving their original characteristics. For example, on an 8-node cluster system, BDGS generates 10 TB Wiki data in 5 hours (more details can be found at Performance numbers of Big Data Generators). To model and reproduce multi-application or multi-user scenarios on large-scale datacenters, we also provide the multi-tenancy version of BigDataBench, which allows the flexible setting and replaying of mixed big data workloads according to the workload traces from real-world applications, e.g., Facebook, Google  and Sougou traces.

For the same workloads, different implementations are provided. Currently, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive analytics and OLAP workloads using Shark, Impala, and Hive.  For  system and architecture researches,  i. e., architecture, OS, networking and storage,  the number of benchmarks will be multiplied by different implementations, and hence become massive.  To reduce the research cost, we select a small number of  representative benchmarks, which we call a BigDataBench subset,  from a large amount of BigDataBench  workloads according to workload characteristics from a specific perspective (please refer to the methodology in our IISWC 2014 paper). For example, for architecture communities, as simulation-based research is very time-consuming, we select a small number of benchmarks from BigDataBench according to comprehensive micro-architectural characteristics, which we call BigDataBench architecture subset,  and provide the simulator version of BigDataBench for architecture research.  Different case studies using BigDataBench have been reported.

Methodology

Methodology3.0

Figure 1 BigDataBench 3.0 methodology

Figure 1 summarizes BigDataBench 3.0 methodology. Overall, it includes six steps: investigating typical application domains; understanding and choosing workloads and data sets; generating scalable data sets and workloads; providing different implementations; system characterization; and finalizing benchmarks. At the first step, we consider typical real-world data sets and representative big data workloads from important application areas, e.g., search engines, social network, and E-commercial. At the second step, we observe and identify a full spectrum of units of computation that frequently appear in big data analytics. At the same time, we understand data sets from the perspectives of data natures, data sources, and data schema. At the third step, we develop Big Data Generator Suite (in short, BDGS) to generate synthetic big data, including text, graph, and table data. At the fourth step, since there are different lines of systems, for the same workloads, different implementations should be available. The fifth step ensures our workloads are representative in terms of not only workloads and data characteristics, but also system characteristics. Finally, we decide our big data analytics benchmarks.

Benchmarks

BigDataBench 3.0 includes six real-world data sets, two synthetic data sets and 32 big data workloads. Users are encouraged to choose workloads according to different application types, for example, cloud OLTP, offline analytics, OLAP, and interactive analytics workloads.

Table 1: The Summary of Data Sets

data sets data size
1 Wikipedia Entries 4,300,000 English articles
2 Amazon Movie Reviews 7,911,684 reviews
3 Google Web Graph 875713 nodes, 5105039 edges
4 Facebook Social Network 4039 nodes, 88234 edges
5 E-commerce Transaction Data Table1: 4 columns, 38658 rows. Table2: 6 columns, 242735 rows
6 ProfSearch Person Resumes 278956 resumes
7 CALDA Data (synthetic data) Table1: 3 columns. Table2: 9 columns.
8 TPC-DS Web Data (synthetic data) 26 tables

Table 2: The Summary of BigDataBench

Application Types Benchmark Types Workloads Data Sets Software Stacks
 Cloud OLTP  Micro Benchmarks Read ProfSearch Person Resumes:   Semi-structured Table  HBase, Mysql
Write
Scan
Application Benchmarks Search Server Wikipedia Entries: Semi-structured Text HBase, Nutch
Offline Analytics   Micro Benchmarks Sort  Wikipedia Entries  MPI, Spark, Hadoop
Grep
WordCount
BFS Graph500 data set: Unstructured Graph MPI
Application Benchmarks Index Wikipedia Entries   MPI, Spark, Hadoop
PageRank Unstructured Graph(Google Web Graph)
Kmeans  Google Web Graph and Facebook Social Network: Unstructured Graph
Connected Components
Collaborative Filtering  Amazon Movie Reviews: Semi-structured Text
Naive Bayes
OLAP and Interactive Analytics  Micro Benchmarks Project   E-commerce Transaction data, CALDA data and TPC-DS Web data: Structured Table Mysql, Hive, Shark, Impala
Filter
OrderBy
Cross Product
Union
Difference
Aggregation
Application Benchmarks Join Query
Select Query
Aggregation Query
Eight TPC-DS Web Queries

History

BigDataBench 1.0 http://prof.ict.ac.cn/BigDataBench/old/1.0/

BigDataBench 2.0 http://prof.ict.ac.cn/BigDataBench/old/2.0/ ͼƬ1

Figure 2 Evolution of BigDataBench

Q &A

Please join  BigDataBench User group on Linkedin  for using and discussing BigDataBench.

1. Hadoop Version

Q1: I can’t generate the input data of Sort workload. And the error message is:    Caused by : java.io.FileNotFoundException: ToSeqFile.jar (No such file or directory)

A1: You should put the sort-transfer file (ToSeqFile.jar) into your $Hadoop_Home directory, and the sort-transfer file can be found at the BigDataBench_V3.0_Hadoop_Hive packet.

Q2: When I run Index workload, the prepar.sh cannot run correctly. The error message is: Error : number of words should be greater than 0

A2: You should make sure that these two folders (linux.words and words) are placed in the path of /usr/share/dict. And these folders can be found at the BigDataBench_V3.0_Hadoop_Hive/SearchEngine/Index directory.

2. Spark Question

Q1:When I ran Spark workloads, it cannot work correctly.
Running command:
./run-bigdatabench cn.ac.ict.bigdatabench.Sort $MASTER /sort-out /tmp/sort
Error message is:
Exception in thread “main” java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
Or
Exception in thread “main” java.lang.NullPointerException
at org.apache.spark.SparkContext$.updatedConf(SparkContext.scala:1426)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at cn.ac.ict.bigdatabench.WordCount$.main(WordCount.scala:21)
at cn.ac.ict.bigdatabench.WordCount.main(WordCount.scala)

A1:
Please do the following checking:
1) Check the version of Hadoop, and the recommended version is Hadoop-1.0.2
2) Check the version of Scala and Spark, and the recommended version is
Spark-0.8.0-incubating-bin-hadoop1
Scala-2.9.3
3) Make sure that Hadoop, Spark and Scala packets are deployed correctly
Our experimental platform :
Hadoop-1.0.2
Spark-0.8.0-incubating-bin-hadoop1
Scala-2.9.3

Contacts (Email)

People

  • Prof. Jianfeng Zhan
  • Lei Wang
  • Chunjie Luo
  • Qiang Yang
  • Yuqing Zhu
  • Wanling Gao
  • Rui Han
  • Zhen Jia
  • Jingwei Li
  • Hainan Ye
  • Shaopeng Dai
  • Alumni

  • Yingjie Shi
  • Zijian Ming

License

BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our BigDataBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our BigDataBench you must understand and comply with their licenses. Software developed externally (not by BigDataBench group)

Software developed internally (by BigDataBench group) BigDataBench_3.0 License BigDataBench_3.0 Suite Copyright (c) 2013-2015, ICT Chinese Academy of Sciences All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistribution of source code must comply with the license and notice disclaimers
  • Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.