Road to Freedom in Big Data Analytics

The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In doing so, they face many challenges; chiefly, platform dependence, poor interoperability, and poor performance when using multiple platforms. We present RHEEM, our vision for big data analytics over diverse data processing platforms. RHEEM provides a threelayer data processing and storage abstraction to achieve both platform independence and interoperability across multiple platforms. In this paper, we discuss our vision as well as present multiple research challenges that we need to address to achieve it. As a case in point, we present a data cleaning application built using some of the ideas of RHEEM. We show how it achieves platform independence and the performance benefits of following such an approach. 1. WHY TIED TO ONE SINGLE SYSTEM? Data analytic tasks may range from very simple to extremely complex pipelines, such as data extraction, transformation, and loading (ETL), online analytical processing (OLAP), graph processing, and machine learning (ML). Following the dictum “one size does not fit all” [23], academia and industry have embarked on an endless race to develop data processing platforms for supporting these different tasks, e.g., DBMSs and MapReduce-like systems. Semantic completeness, high performance, and scalability are key objectives of such platforms. While there have been major achievements in these objectives, users still face two main roadblocks. The first roadblock is that applications are tied to a single processing platform, making the migration of an application to new and more efficient platforms a difficult and costly task. Furthermore, complex analytic tasks usually require the combined use of different processing platforms. As a result, the common practice is to develop several specialized analytic applications on top of different platforms. This requires users to manually combine the results to draw a conclusion. In addition, users may need to re-implement existing applications on top of faster processing platforms when ∗Work done while at QCRI. c ©2016, Copyright is with the authors. Published in Proc. 19th International Conference on Extending Database Technology (EDBT), March 15-18, 2016 Bordeaux, France: ISBN 978-3-89318-070-7, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 these become available. For example, Spark SQL [3] and MLlib [2] are the Spark counterparts of Hive [24] and Mahout [1]. The second roadblock is that datasets are often produced by different sources and hence they natively reside on different storage platforms. As a result, users often perform tedious, time-intensive, and costly data migration and integration tasks for further analysis. Let us illustrate these roadblocks with an Oil & Gas industry example [13]. A single oil company can produce more than 1.5TB of diverse data per day [6]. Such data may be structured or unstructured and come from heterogeneous sources, such as sensors, GPS devices, and other measuring instruments. For instance, during the exploration phase, data has to be acquired, integrated, and analyzed in order to predict if a reservoir would be profitable. Thousands of downhole sensors in exploratory wells produce real-time seismic data for monitoring resources and environmental conditions. Users integrate these data with the physical properties of the rocks to visualize volume and surface renderings. From these visualizations, geologists and geophysicists formulate hypotheses and verify them with ML methods, such as regression and classification. Training of the models is performed with historical drilling and production data, but oftentimes users have to go over unstructured data, such as notes exchanged by emails or text from drilling reports filed in a cabinet. Thus, an application supporting such a complex analytic pipeline has to access several sources for historical data (relational, but also text and semi-structured), remove the noise from the streaming data coming from the sensors, and run both traditional (such as SQL) and statistical analytics (such as ML algorithms) over different processing platforms. Similar examples can be drawn from many other domains such as healthcare: e.g., IBM reported that North York hospital needs to process 50 diverse datasets, which are on a dozen different internal systems [15]. These emerging applications clearly show the need for complex analytics coupled with a diversity of processing platforms, which raises two major research challenges. Data Processing Challenge. Users are faced with various choices on where to process their data, each choice with possibly orders of magnitude differences in terms of performance. However, users have to be intimate with the intricacies of the processing platform to achieve high efficiency and scalability. Moreover, once a decision is taken, users may end up being tied up to a particular platform. As a result, migrating the data analytics stack to a more efficient processing platform often becomes a nightmare. Thus, there is a need to build a system that offers data processing platform independence. Furthermore, complex analytic applications require executing tasks over different processing platforms to achieve high performance. For example, one may aggregate large datasets with traditional queries on top of a relational database such as PostgreSQL, but ML tasks might be much faster if executed on Spark [28]. HowVisionary Paper Series ISSN: 2367-2005 479 10.5441/002/edbt.2016.45 ever, this requires a considerable amount of manual work in selecting the best processing platforms, optimizing tasks for the chosen platforms, and coordinating task execution. Thus, this also calls for multi-platform task execution. Data Storage Challenge. Data processing platforms are typically tightly coupled with a specific storage solution. Moving data from a certain storage (e.g., a relational DB) to a more suitable processing platform for the actual task (e.g., Spark on HDFS) requires shuffling data between different systems. Such shuffling may end up dominating the execution time. Moreover, different departments in the same organization may go for different storage engines due to legacy as well as performance reasons. Dealing with such heterogeneity calls for data storage independence. To tackle these two challenges, we envision a system, called RHEEM1, that provides both platform independence and interoperability (Section 2). In the following, we first discuss our vision for the data processing abstraction (Section 3), which is fully based on user-defined functions (UDFs) to provide adaptability as well as extensibility. This processing abstraction allows both users to focus only on the logic of their data analytic tasks and applications to be independent from the data processing platforms. We then discuss how to divide a complex analytic task into smaller subtasks to exploit the availability of different processing platforms (Section 4). As a result, RHEEM can run simultaneously a single data analytic task over multiple processing platforms to boost performance. Next, we present our first attempt to build an instance application based on some of the ideas of RHEEM and the resulting benefits (Section 5). We then show how we push down the processing abstraction idea to the storage layer (Section 6). This storage abstraction allows both users to focus on their storage needs and the processing platforms to be independent from the storage engines. Some initial efforts are also going into the direction of providing data processing platform independence [11,12,21] (Section 7). However, our vision goes beyond the data processing. We not only envision a data processing abstraction but also a data storage abstraction, allowing us to consider data movement costs during task optimization. We give a research agenda highlighting the challenges that need to be tackled to build RHEEM in Section 8.

[1]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[2]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[3]  Luc Quoniam,et al.  How to Use Big Data Technologies to Optimize Operations in Upstream Petroleum Industry , 2013, ArXiv.

[4]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[5]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[6]  Jorge-Arnulfo Quiané-Ruiz,et al.  WWHow! Freeing Data Storage from Cages , 2013, CIDR.

[7]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[8]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[9]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[10]  Samuel Madden,et al.  CARTILAGE: adding flexibility to the Hadoop skeleton , 2013, SIGMOD '13.

[11]  Volker Markl,et al.  Peeking into the optimization of data flow programs with MapReduce-style UDFs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[13]  Paolo Papotti,et al.  Lightning Fast and Space Efficient Inequality Joins , 2015, Proc. VLDB Endow..

[14]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[15]  Felix Naumann,et al.  SOFA: An extensible logical optimizer for UDF-heavy data flows , 2015, Inf. Syst..

[16]  Michael Stonebraker,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract) , 2005, ICDE.

[17]  Steven Hand,et al.  Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[18]  Michael Stonebraker,et al.  A Demonstration of the BigDAWG Polystore System , 2015, Proc. VLDB Endow..

[19]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[20]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[21]  Shivnath Babu,et al.  How to Fit when No One Size Fits , 2013, CIDR.

[22]  Ioana Manolescu,et al.  Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[23]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.