News: User group on Linkedin. BigDataBench subset and simulator version for architecture communities released. Multi-media data sets and workloads available soon. A tutorial on BigDataBench at Micro 2014 (December 13-17, 2014, Cambridge, UK)
Summary
As a multi-discipline—e.g., system, architecture, and data management—research effort, BigDataBench is a big data benchmark suite (please refer to our summary paper and presentation at HPCA 2014). The current version is BigDataBench 3.0. It includes 6 real-world and 2 synthetic data sets, and 32 big data workloads, covering micro and application benchmarks from typical application domains, e. g., search engine, social networks, and e-commerce. In generating representative and variety of big data workloads, BigDataBench focuses on units of computation that frequently appear in Cloud “OLTP”, OLAP, interactive and offline analytics (Please refer to our DASFAA paper). BigDataBench also provides several (parallel) big data generation tools–BDGS– to generate scalable big data, e.g. PB scale, from small-scale real-world data while preserving their original characteristics. For example, on an 8-node cluster system, BDGS generates 10 TB Wiki data in 5 hours (more details can be found at Performance numbers of Big Data Generators). To model and reproduce multi-application or multi-user scenarios on large-scale datacenters, we also provide the multi-tenancy version of BigDataBench, which allows the flexible setting and replaying of mixed big data workloads according to the workload traces from real-world applications, e.g., Facebook, Google and Sougou traces.
For the same workloads, different implementations are provided. Currently, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive analytics and OLAP workloads using Shark, Impala, and Hive. For system and architecture researches, i. e., architecture, OS, networking and storage, the number of benchmarks will be multiplied by different implementations, and hence become massive. To reduce the research cost, we select a small number of representative benchmarks, which we call a BigDataBench subset, from a large amount of BigDataBench workloads according to workload characteristics from a specific perspective (please refer to the methodology in our IISWC 2014 paper). For example, for architecture communities, as simulation-based research is very time-consuming, we select a small number of benchmarks from BigDataBench according to comprehensive micro-architectural characteristics, which we call BigDataBench architecture subset, and provide the simulator version of BigDataBench for architecture research. Different case studies using BigDataBench have been reported.
Methodology
Figure 1 BigDataBench 3.0 methodology
Figure 1 summarizes BigDataBench 3.0 methodology. Overall, it includes six steps: investigating typical application domains; understanding and choosing workloads and data sets; generating scalable data sets and workloads; providing different implementations; system characterization; and finalizing benchmarks. At the first step, we consider typical real-world data sets and representative big data workloads from important application areas, e.g., search engines, social network, and E-commercial. At the second step, we observe and identify a full spectrum of units of computation that frequently appear in big data analytics. At the same time, we understand data sets from the perspectives of data natures, data sources, and data schema. At the third step, we develop Big Data Generator Suite (in short, BDGS) to generate synthetic big data, including text, graph, and table data. At the fourth step, since there are different lines of systems, for the same workloads, different implementations should be available. The fifth step ensures our workloads are representative in terms of not only workloads and data characteristics, but also system characteristics. Finally, we decide our big data analytics benchmarks.
Benchmarks
BigDataBench 3.0 includes six real-world data sets, two synthetic data sets and 32 big data workloads. Users are encouraged to choose workloads according to different application types, for example, cloud OLTP, offline analytics, OLAP, and interactive analytics workloads.
Table 1: The Summary of Data Sets
data sets | data size | |
1 | Wikipedia Entries | 4,300,000 English articles |
2 | Amazon Movie Reviews | 7,911,684 reviews |
3 | Google Web Graph | 875713 nodes, 5105039 edges |
4 | Facebook Social Network | 4039 nodes, 88234 edges |
5 | E-commerce Transaction Data | Table1: 4 columns, 38658 rows. Table2: 6 columns, 242735 rows |
6 | ProfSearch Person Resumes | 278956 resumes |
7 | CALDA Data (synthetic data) | Table1: 3 columns. Table2: 9 columns. |
8 | TPC-DS Web Data (synthetic data) | 26 tables |
Table 2: The Summary of BigDataBench
Application Types | Benchmark Types | Workloads | Data Sets | Software Stacks |
Cloud OLTP | Micro Benchmarks | Read | ProfSearch Person Resumes: Semi-structured Table | HBase, Mysql |
Write | ||||
Scan | ||||
Application Benchmarks | Search Server | Wikipedia Entries: Semi-structured Text | HBase, Nutch | |
Offline Analytics | Micro Benchmarks | Sort | Wikipedia Entries | MPI, Spark, Hadoop |
Grep | ||||
WordCount | ||||
BFS | Graph500 data set: Unstructured Graph | MPI | ||
Application Benchmarks | Index | Wikipedia Entries | MPI, Spark, Hadoop | |
PageRank | Unstructured Graph(Google Web Graph) | |||
Kmeans | Google Web Graph and Facebook Social Network: Unstructured Graph | |||
Connected Components | ||||
Collaborative Filtering | Amazon Movie Reviews: Semi-structured Text | |||
Naive Bayes | ||||
OLAP and Interactive Analytics | Micro Benchmarks | Project | E-commerce Transaction data, CALDA data and TPC-DS Web data: Structured Table | Mysql, Hive, Shark, Impala |
Filter | ||||
OrderBy | ||||
Cross Product | ||||
Union | ||||
Difference | ||||
Aggregation | ||||
Application Benchmarks | Join Query | |||
Select Query | ||||
Aggregation Query | ||||
Eight TPC-DS Web Queries |
History
BigDataBench 1.0 http://prof.ict.ac.cn/BigDataBench/old/1.0/
BigDataBench 2.0 http://prof.ict.ac.cn/BigDataBench/old/2.0/
Figure 2 Evolution of BigDataBench
Q &A
Please join BigDataBench User group on Linkedin for using and discussing BigDataBench.
1. Hadoop Version
Q1: I can’t generate the input data of Sort workload. And the error message is: Caused by : java.io.FileNotFoundException: ToSeqFile.jar (No such file or directory)
A1: You should put the sort-transfer file (ToSeqFile.jar) into your $Hadoop_Home directory, and the sort-transfer file can be found at the BigDataBench_V3.0_Hadoop_Hive packet.
Q2: When I run Index workload, the prepar.sh cannot run correctly. The error message is: Error : number of words should be greater than 0
A2: You should make sure that these two folders (linux.words and words) are placed in the path of /usr/share/dict. And these folders can be found at the BigDataBench_V3.0_Hadoop_Hive/SearchEngine/Index directory.
2. Spark Question
Q1:When I ran Spark workloads, it cannot work correctly.
Running command:
./run-bigdatabench cn.ac.ict.bigdatabench.Sort $MASTER /sort-out /tmp/sort
Error message is:
Exception in thread “main” java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
Or
Exception in thread “main” java.lang.NullPointerException
at org.apache.spark.SparkContext$.updatedConf(SparkContext.scala:1426)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
at cn.ac.ict.bigdatabench.WordCount$.main(WordCount.scala:21)
at cn.ac.ict.bigdatabench.WordCount.main(WordCount.scala)
A1:
Please do the following checking:
1) Check the version of Hadoop, and the recommended version is Hadoop-1.0.2
2) Check the version of Scala and Spark, and the recommended version is
Spark-0.8.0-incubating-bin-hadoop1
Scala-2.9.3
3) Make sure that Hadoop, Spark and Scala packets are deployed correctly
Our experimental platform :
Hadoop-1.0.2
Spark-0.8.0-incubating-bin-hadoop1
Scala-2.9.3
Contacts (Email)
- Please join BigDataBench User group on Linkedin for Q&A.
- zhanjianfeng@ict.ac.cn
- wl@ncic.ac.cn
- lijingwei@ict.ac.cn
People
- Prof. Jianfeng Zhan
- Lei Wang
- Chunjie Luo
- Qiang Yang
- Yuqing Zhu
- Wanling Gao
- Rui Han
- Zhen Jia
- Jingwei Li
- Hainan Ye
- Shaopeng Dai
-
Alumni
- Yingjie Shi
- Zijian Ming
License
BigDataBench is available for researchers interested in big data. Software components of BigDataBench are all available as open-source software and governed by their own licensing terms. Researchers intending to use BigDataBench are required to fully understand and abide by the licensing terms of the various components. BigDataBench is open-source under the Apache License, Version 2.0. Please use all files in compliance with the License. Our BigDataBench Software components are all available as open-source software and governed by their own licensing terms. If you want to use our BigDataBench you must understand and comply with their licenses. Software developed externally (not by BigDataBench group)
- Boost: http://www.boost.org/doc/libs/1_43_0/more/getting_started/unix-variants.html
- GCC: http://gcc.gnu.org/releases.html
- GSL: http://www.gnu.org/software/gsl/
- Graph500 : http://www.graph500.org/referencecode
- Hadoop: http://www.apache.org/licenses/LICENSE-2.0
- HBase : http://hbase.apache.org/
- Hive: http://hive.apache.org/
- Impala: https://github.com/cloudera/impala
- MySQL : http://www.mysql.com/
- Mahout: http://www.apache.org/licenses/LICENSE-2.0
- Mpich: http://www.mpich.org/
- Nutch : http://www.apache.org/licenses/LICENSE-2.0
- Parallel Boost Graph Library : http://www.osl.iu.edu/research/pbgl/software/
- Spark: http://spark.incubator.apache.org/
- Shark: http://shark.cs.berkeley.edu/
- Scala: http://www.scala-lang.org/download/2.9.3.html
- Zookeeper: http://zookeeper.apache.org/
Software developed internally (by BigDataBench group) BigDataBench_3.0 License BigDataBench_3.0 Suite Copyright (c) 2013-2015, ICT Chinese Academy of Sciences All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistribution of source code must comply with the license and notice disclaimers
- Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided by the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ICT CHINESE ACADEMY OF SCIENCES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.