BigDataBench 5.0 is released.

BigDataBench 5.0 User Manual [BigDataBench-UserManual]

BigDataBench JStorm User Manual [BigDataBench-JStorm-UserManual]

BigDataBench Spark Streaming User Manual [BigDataBench-SparkStreaming-UserManual]

Table 1: The Summary of Data Sets

data sets data size Scalable data set
1 Wikipedia Entries 4,300,000English articles(unstructuredtext) Text Generator of BDGS
2 Amazon Movie Reviews 7,911,684 reviews(semi-structured text) Text Generator of BDGS
3 Google Web Graph 875713 nodes, 5105039 edges(unstructured graph) Graph Generator of BDGS
4 Facebook
Social Network
4039 nodes, 88234 edges (unstructured graph) Graph Generator of BDGS
5 E-commerce Transaction Data table1:4 columns,38658 rows.
table2: 6columns, 242735 rows(structured table)
Table Generator of BDGS
6 ProfSearch Person Resumes 278956 resumes(semi-structured table) Table Generator of BDGS
7 CIFAR-10 60000 color images with the dimension of 32*32 Ongoing development
8 ImageNet (1GB,10GB) ILSVRC2014 DET image dataset(unstructured image) Ongoing development
9 LSUN One million labelled images, classified into 10 scene categories and 20 object categories Ongoing development
10 TED Talks Translated TED talks provided by IWSLT evaluation campaign Ongoing development
11 SoGou Data
(Search Data processed from SogouT)
the corpus and search query data from
So-Gou Labs(unstructured text)
Ongoing development
12 MNIST handwritten digits database which has 60,000
training examples and 10,000 test examples(unstructured image)
Ongoing development
13 MovieLens Dataset User’s score data for movies, which has 9,518,231
training examples and 386,835 test examples(semi-structured text)
Ongoing development
14 WMT English-German WMT English-German translation dataset Ongoing development
15 MS COCO2014 A large-scale object detection, segmentation, and captioning dataset Ongoing development
16 Cityscapes A new large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities Ongoing development
17 LibriSpeech A corpus of approximately 1000 hours of 16kHz read English speech Ongoing development
18 VGGFace2 A large-scale face recognition dataset Ongoing development

We provide two options: download the full software package one time or download components one by one. Please note that you need to download and deploy prerequisite software packages before using BigDataBench. Please refer to the user manual. The following packages should be installed firstly, and the running platform is Linux.

Software Version Download
TensorFlow 1.12
PyTorch 1.0.1
Hadoop 1.0.2
HBase 0.94.5
Cassandra 1.2.3
MongoDB 2.4.1
Mahout 0.8
Hive 0.9.0 #GettingStarted-InstallationandConfiguration
Spark 0.8.0
Shark 0.8.0
Impala 1.1.1
Flink 0.10.1
Boost 1_43_0
Scala 2.9.3
GCC 4.8.2
GSL 1.16

Full software packages of different implementations are available from the following links.

Micro Benchmark:

Component Benchmark:

Application Benchmark:

Proxy Benchmark for Micro-architectural Simulation:

BDGSBig Data Generator Suite in BigDataBench

Name Description
BDGS generates big data on the basis of six raw data sets Text BigDataGeneratorSuite.tar.gz
Size: 40MB