BPOE-9 | The Ninth workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware

Preliminary BPOE program is released.

Introduction

Big data has emerged as a strategic property of nations and organizations. There are driving needs to generate values from big data. However, the sheer volume of big data requires significant storage capacity, transmission bandwidth, computations, and power consumption. It is expected that systems with unprecedented scales can resolve the problems caused by varieties of big data with daunting volumes. Nevertheless, without big data benchmarks, it is very difficult for big data owners to make choice on which system is best for meeting with their specific requirements. They also face challenges on how to optimize the systems and their solutions for specific or even comprehensive workloads. Meanwhile, researchers are also working on innovative data management systems, hardware architectures, operating systems, and programming systems to improve performance in dealing with big data.

This workshop, the ninth its series, focuses on architecture and system support for big data systems, aiming at bringing researchers and practitioners from data management, architecture, and systems research communities together to discuss the research issues at the intersection of these areas.

Call for Papers

Topics

The workshop seeks papers that address hot topic issues in benchmarking, designing and optimizing big data systems. Specific topics of interest include but are not limited to:

Low latency big data computing
Big scientific data
Big data workload characterization and benchmarking
Performance analysis of big data systems
Workload-optimized big data systems
Innovative prototypes of big data infrastructures
Emerging hardware technologies in big data systems
Operating systems support for big data systems
Interactions among architecture, systems and data management
Hardware and software co-design for big data
Practice report of evaluating and optimizing large-scale big data systems

Papers should present original research. As big data spans many disciplines, papers should provide sufficient background material to make them accessible to the broader community.

Download CFP

Paper Submissions

Papers must be submitted in PDF, and be no more than 8 pages in standard two-column SIGPLAN conference format including figures and tables but not including references. Shorter submissions are encouraged. The submissions will be judged based on the merit of the ideas rather than the length. Submissions must be made through the on-line submission site.

Submission site: https://easychair.org/conferences/?conf=bpoe9

Best paper award will be announced at the end of the workshop!

Important Dates

Papers due February 1, 2018
Papers due February 10, 2018
~~Notification of acceptance February 20, 2018~~
Notification of acceptance February 25, 2018
Workshop Session March 25, 2018

Program

8:30-8:40

Opening remark [PDF]

8:40-9:40

Keynote I: Architecting for Big Data Analytics: Think Dubai rather than Venice. [PDF]

Speaker: Prof. Vijay Janapa Reddi, The University of Texas at Austin

Abstract: Big data, specifically data analytics, is responsible for driving many of consumers' most common online activities, including shopping, web searches, and interactions on social media. This talk focuses on two important aspects of big data analytics that affect both industry and academia, from a research as well as a development perspective. First is the striking dearth of thread-level parallelism (TLP) in big data analytics applications, which partially undermines the scale-out push of modern data centers. Second is the challenge of characterizing and analyzing these scale-out workloads in a manner that can enable fine-grained, transparent program introspection at the architecture level both within and across nodes. The talk first introduces the challenges on both these fronts. It then highlights potential solutions, across both industry and academia, for addressing them.

Bio: Vijay Janapa Reddi is currently a Visting Research Scientist at Google, on leave from his position as an Associate Professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin. His research interests include computer architecture and software design to enhance mobile and high-performance computing systems, specifically focusing on always-on computing and end-user experience for mobile devices and energy efficiency and reliability for heterogeneous system architectures. Dr. Janapa Reddi is a recipient of multiple awards, including the National Academy of Engineering (NAE) Gilbreth Lecturer Honor (2016), IEEE TCCA Young Computer Architect Award (2016), Intel Early Career Award (2013), Google Faculty Research Awards (2012, 2013, 2015, 2017), Best Paper at the 2005 International Symposium on Microarchitecture, Best Paper at the 2009 International Symposium on High Performance Computer Architecture, and IEEE’s Top Picks in Computer Architecture awards (2006, 2010, 2011, 2016, 2017). Beyond his technical research contributions, Dr. Janapa Reddi is passionate about STEM education. He is responsible for the Austin Independent School District’s “hands-on” computer science (HaCS) program, which teaches 6th- and 7th-grade students programming and the general principles that govern a computing system using open-source electronic prototyping platforms. He received a BS in computer engineering from Santa Clara University, an MS in electrical and computer engineering from the University of Colorado at Boulder, and a PhD in computer science from Harvard University.

9:40-10:10

Talk I: BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite [PDF]

Speaker: Ms. Wanling Gao, ICT, CAS

Abstract: As architecture, system, data management, and machine learning communities pay greater attention to innovative big data and data-driven artificial intelligence (in short, AI) algorithms, architecture, and systems, the pressure of benchmarking rises. However, complexity, diversity, frequently changed workloads, and rapid evolution of big data, especially AI systems raise great challenges in benchmarking. First, for the sake of conciseness, benchmarking scalability, portability cost, reproducibility, and better interpretation of performance data, we need understand what are the abstractions of frequently-appearing units of computation, which we call dwarfs, among big data and AI workloads. Second, for the sake of fairness, the benchmarks must include diversity of data and workloads. Third, for co-design of software and hardware, the benchmarks should be consistent across different communities. Other than creating a new benchmark or proxy for every possible workload, we propose using dwarf-based benchmarks— the combination of eight dwarfs—to represent diversity of big data and AI workloads. The current version—BigDataBench 4.0 provides 13 representative real-world data sets and 47 big data and AI benchmarks, including seven workload types: AI, online service, offline analytics, graph analytics, data warehouse, NoSQL, and streaming. BigDataBench 4.0 is publicly available from http://prof.ict.ac.cn/BigDataBench.

Bio: Wangling Gao is a Ph.D candidate in computer science at the Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences. Her research interests focus on big data benchmark and big data analytics. She received her B.S. degree in 2012 from Huazhong University of Science and Technology.

10:10-10:30

Tea Break

10:30-11:20

Talk II: Benchmarking and Characterizing Deep Learning over Big Data (DLoBD) Stacks

Speaker: Dr. Xiaoyi Lu, Ohio State University

Abstract: Deep Learning over Big Data (DLoBD) is becoming one of the most important research paradigms to mine value from the massive amount of gathered data. Many emerging deep learning frameworks start running over Big Data stacks, such as Hadoop and Spark. With the convergence of HPC, Big Data, and Deep Learning, these DLoBD stacks are taking advantage of RDMA and multi-/many-core based CPUs/GPUs. Even though a lot of activities are happening in the field, there is a lack of systematic studies on analyzing the impact of RDMA-capable networks and CPU/GPU on DLoBD stacks. To fill this gap, this talk will present a systematical characterization methodology and extensive performance evaluations on four representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL) to expose the interesting trends regarding performance, scalability, accuracy, and resource utilization. Our observations show that RDMA-based design for DLoBD stacks can achieve up to 2.7x speedup compared to the IPoIB based scheme. The RDMA scheme can also scale better and utilize resources more efficiently than the IPoIB scheme over InfiniBand clusters. For most cases, GPU-based deep learning can outperform CPU-based designs, but not always. Our studies show that there are large rooms to improve the designs of current generation DLoBD stacks. Finally, we will present some in-depth case studies and benchmarking results for deep learning workloads on Spark and TensorFlow framework.

Bio: Dr. Xiaoyi Lu is a Research Scientist in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Deep Learning, Hadoop/Spark/Memcached/TensorFlow Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, and Cloud Computing. He has published over 90 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, Invited Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 275 organizations from 34 countries. More than 25,500 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High-Performance MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932.

11:20-11:50

Regular Paper I: Rajarshi Biswas, Xiaoyi Lu and Dhabaleswar Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences
[PDF]

Authors: Rajarshi Biswas, Xiaoyi Lu and Dhabaleswar Panda
(The Ohio State University)

Abstract: Remote procedure call (RPC) is the backbone of many mod- ern distributed systems. Google’s gRPC is one of the most popular open source RPC frameworks available in the com- munity. gRPC is the main communication engine for Google’s Deep Learning framework TensorFlow. TensorFlow primar- ily uses gRPC for communicating tensors and administrative tasks among different processes. Tensor updates during the training phase are communication intensive and thus Tensor- Flow’s performance is heavily dependent on the underlying network and the ef cacy of the communication engine. Train- ing deep learning models on TensorFlow can take signi cant time ranging from several minutes to several hours, even several days. Thus system researchers need to devote a lot of time to understand the impact of communication on the overall performance. Clearly, there is lack of benchmarks available for system researchers. Therefore, we propose TF- gRPC-Bench micro-benchmark suite that enables system re- searches to quickly understand the impact of the underlying network and communication runtime on deep learning work- loads. To achieve this, we rst analyze the characteristics of TensorFlow workload over gRPC by training popular deep learning models. Then, we propose three micro-benchmarks that takes account these workload characteristics. In addition, we comprehensively evaluate gRPC with TF-gRPC-Bench micro-benchmark suite on different clusters equipped with different networking interconnects and present the result.

12:00-13:30

Lunch Break

13:30-14:30

Keynote II: The OS scheduler: a performance-critical component in Linux cluster environments
[PDF]

Speaker: Dr. Jean-Pierre Lozi, Oracle Labs Zurich

Abstract: Modern clusters rely on servers containing dozens of cores that have high infrastructure cost and energy consumption. Therefore, it is economically essential to exploit their full potential at both the application and the system level. While today's clusters often rely on Linux, recent research has shown that the Linux scheduler has been suffering from performance bugs that lead to core under-usage. Such bugs have gone unnoticed for many years. The consequences are wasted energy, bad infrastructure usage, and lower service response time. In this talk, we first present examples of critical performance bugs in the Linux scheduler that went unnoticed for years. We then present our analysis of how we ended up in such a situation, and discuss possible research directions to overcome this problems.

Bio: Jean-Pierre Lozi is a Principal Member of the Technical Staff at Oracle Labs Zurich, working on distributed graph processing (PGX.D). He was formerly an Associate Professor at the University of Nice Sophia Antipolis in France, where he was working on multicore architectures, lock and synchronization algorithms, and OS schedulers.

14:30-15:30

Keynote III: Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation
[PDF]

Speaker: Prof. Yiying Zhang, Purdue University

Abstract: Datacenters have been using a "monolithic" server model for decades, where each server hosts a set of hardware devices such as processors and memory chips and runs an OS or a hypervisor. This monolithic architecture is easy to deploy but cannot support heterogeneity, elasticity, resource utilization, or failure handling well. To meet the growing flexibility and heterogeneity demands of applications and hardware in today's datacenters, we have to rethink the decades-long server-centric model. I believe the answer is to break the server boundary when allocating and using hardware resources. My lab has been exploring various approaches of such "resource disaggregation" in two directions: physically disaggregating hardware devices and virtually using remote hardware resources. In this talk, I will discuss the challenges and potentials of datacenter resource disaggregation and introduce a set of efforts my lab has made in system software, hardware, and networking to build the next-generation resource-disaggregated datacenters.

Bio: Yiying Zhang is an assistant professor in the School of Electrical and Computer Engineering at Purdue University. Her research interests span operating systems, distributed systems, datacenter networking, and computer architecture, with a focus on building network, software, and hardware systems for next-generation datacenters. Her lab is pioneering in the field of datacenter resource disaggregation and is among the few groups in the world that builds new OSes and full-stack, cross-layer systems. She has published at and served on the program committees of top systems conferences such as SOSP, ASPLOS, and FAST, and her work has attracted various industry and academia attentions. Yiying received her Ph.D. from the Department of Computer Sciences at the University of Wisconsin-Madison and worked as a postdoctoral scholar at the University of California, San Diego before joining Purdue.

15:30-16:00

Tea Break

16:00-16:30

Talk III: BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing
[PDF]

Speaker: Dr. Chen Zheng, ICT, CAS

Abstract: The past decades witness FLOPS (Floating-point Operations per Second), as an impor- tant computation-centric performance metric, guides computer architecture evolution, bridges hardware and software co-design, and provides quantitative performance number for system optimization. However, for emerging datacenter computing (in short, DC) workloads, such as internet services or big data analytics, previous work reports on the modern CPU architec- ture that the average proportion of floating-point instructions only takes 1% and the average FLOPS efficiency is only 0.1%, while the average CPU utilization is high as 63%. These contra- dicting performance numbers imply that FLOPS is inappropriate for evaluating DC computer systems. To address the above issue, we propose a new computation-centric metric BOPS (Basic OPerations per Second). In our definition, Basic Operations include all of arithmetic, logical, comparing and array addressing operations for integer and floating point. BOPS is the average number of BOPs (Basic OPerations) completed each second. To that end, we present a dwarf-based measuring tool to evaluate DC computer systems in terms of our new metrics. On the basis of BOPS, also we propose a new roofline performance model for DC computing. Through the experiments, we demonstrate that our new metrics–BOPS, measuring tool, and new performance model indeed facilitate DC computer system design and optimization.

Bio: Chen Zheng is a post doc researcher at the Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences. His research focuses on Operating System, Virtualization, benchmarks, and data center workload characterization. He received his PHD degree in 2017 from Institute of Computing Technology in China.

16:30-17:00

Regular Paper II: Using FPGAs as Microservices: Technology, Challenges and Case Study
[PDF]

Authors: David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer
(University of Florida, DELL EMC, Intel Corporation)

Abstract: Field-programmable gate arrays (FPGAs) have largely been used in communication and high-performance computing, and given the recent advances in big data, machine learning and emerging trends in cloud computing (e.g., serverless), FPGAs are increasingly being introduced into these domains (e.g., Microsoft’s datacenters and Amazon Web Services). To address these domains’ processing needs, recent research has focused on using FPGAs to accelerate workloads, ranging from analytics and machine learning to databases and network function virtualization. In this paper, we present a high-performance FPGA-as-a-microservice (FaaM) architecture for the cloud. We discuss some of the technical challenges and propose several solutions for efficiently integrating FPGAs into virtualized environments. Our case study deploying a multi-threaded, multi-user compression as a microservice using FaaM indicates that microservices-based FPGA acceleration can sustain high- performance as compared to a straightforward CPU implementation with minimal to no runtime overhead despite the hardware abstraction.

Venue Information

BPOE: Allegheny Room BC, Williamsburg Lodge.

Contact Information

Prof. Jianfeng Zhan: zhanjianfeng@ict.ac.cn
Dr. Chen Zheng: zhengchen@ict.ac.cn
Dr. Gang Lu: lugang@mail.bafst.com

Organization

Steering committee:

Christos Kozyrakis, Stanford
Xiaofang Zhou, University of Queensland
Dhabaleswar K Panda, Ohio State University
Aoying Zhou, East China Normal University
Raghunath Nambiar, Cisco
Lizy K John, University of Texas at Austin
Xiaoyong Du, Renmin University of China
Ippokratis Pandis, IBM Almaden Research Center
Xueqi Cheng, ICT, Chinese Academy of Sciences
Bill Jia, Facebook
Lidong Zhou, Microsoft Research Asia
H. Peter Hofstee, IBM Austin Research Laboratory
Alexandros Labrinidis, University of Pittsburgh
Cheng-Zhong Xu, Wayne State University
Jianfeng Zhan, ICT, Chinese Academy of Sciences
Guang R. Gao, University of Delaware.
Yunquan Zhang, ICT, Chinese Academy of Sciences

Program Co-Chairs:

Prof. Jianfeng Zhan, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences
Dr. Chen Zheng, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences
Dr. Gang Lu, Beijing Academy of Frontier Science & Technology

Web and Publicity Chairs:

Wanling Gao, ICT, CAS and UCAS gaowanling@ict.ac.cn
Shaopeng Dai, ICT, CAS and UCAS daishaopeng@ict.ac.cn
Zihan Jiang, ICT, CAS and UCAS jiangzihan@ict.ac.cn

Keynote speaker

TBD

Program committee (Confirmed)

Lei Wang, Institute of computing technology, Chinese Academy of Science
Woongki Baek, Ulsan National Institute of Science and Technology (UNIST)
Trevor E. Carlson, National University of Singapore
Vijay Nagarajan, University of Edinburgh
Jack Sampson, Pennsylvania State University
Lucas Mello Schnorr, Federal University of Rio Grande do Sul (UFRGS)
Witawas Srisa-An, University of Nebraska at Lincoln
Zhen Jia, Princeton University
Xiaoyi Lu, The Ohio State University
Zhibin Yu, Shenzhen Institutes of Advanced Technology
Rong Zhang, East China Normal University

Previous Events

Workshop	Dates	Location
BPOE-1	October 7, 2013	IEEE BigData Conference, San Jose, CA
BPOE-2	October 31,2013	CCF HPC China, Guilin, China
BPOE-3	December 5,2013	CCF Big Data Technology Conference 2013, BeiJing, China
BPOE-4	March 1, 2014	ASPLOS 2014, Salt Lake City, Utah, USA
BPOE-5	September 5, 2014	VLDB 2014, Hangzhou, Zhejiang Province, China
BPOE-6	September 4, 2015	VLDB 2015, Hilton Waikoloa Village, Kohala Coast , Hawai‘i
BPOE-7	April 3, 2016	ASPLOS 2016, Atlanta, GA, USA
BPOE-8	April 9, 2017	ASPLOS 2017, Xi’an, China