Big data multi-query optimisation with Apache Flink

Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data processing (batch), Flink's query optimiser is built based on techniques which have been used in the parallel database systems. Flink query optimiser translates the queries into jobs which are repeatedly submitted with similar tasks. Therefore, exploiting the similarity of tasks can avoid redundant computation. In this paper, Flink multi-query optimisation system, Flink-MQO, has been proposed and built on top of Flink software stack. It is considered as an add-on to Apache Flink to optimise multi-query based on data sharing. The Flink-MQO system exploits the data sharing opportunities of selection operators to eliminate the redundancy and duplication of data in-network movement of multi-query. Experimental results show that the exploiting of shared selection operators in big data multi-query can provide promising query execution time. Therefore, Flink-MQO system can potentially be used in the stream processing to improve the performance of the real-time applications.

[1]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[2]  Rajendra Akerkar,et al.  Big Data Computing , 2013 .

[3]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[4]  Herodotos Herodotou,et al.  Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[5]  Fawzya Ramadan Sayed,et al.  SQL TO Flink Translator , 2015 .

[6]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Anastasia Ailamaki,et al.  Sharing Data and Work Across Concurrent Analytical Queries , 2013, Proc. VLDB Endow..

[8]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Rajendra Kumar Shukla,et al.  Big Data Frameworks: At a Glance , 2015 .

[11]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[12]  Norman Spangenberg,et al.  Evaluating New Approaches of Big Data Analytics Frameworks , 2015, BIS.

[13]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[14]  Reda Alhajj,et al.  Using Object-Oriented Materialized Views to Answer Selection-Based Complex Queries , 1999, Inf. Sci..

[15]  Giorgos Stoilos,et al.  Query rewriting under query refinements , 2014, Knowl. Based Syst..

[16]  Wei Zhou,et al.  The skip-octree: a dynamic cloud storage index framework for multidimensional big data systems , 2015, Int. J. Web Eng. Technol..

[17]  Arie Segev,et al.  Using common subexpressions to optimize multiple queries , 1988, Proceedings. Fourth International Conference on Data Engineering.

[18]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[19]  Aris Gkoulalas-Divanis,et al.  Large-Scale Data Analytics , 2014, Springer New York.

[20]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[21]  Murat Ali Bayir,et al.  Improving the performance of Hadoop Hive by sharing scan and computation tasks , 2014, Journal of Cloud Computing.

[22]  Ioana Manolescu,et al.  Reuse-based Optimization for Pig Latin , 2016, CIKM.

[23]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[24]  Radhya Sahal,et al.  Comparative Study of Multi-query Optimization Techniques using Shared Predicate-based for Big Data , 2016 .

[25]  Hakan Hacigümüs,et al.  Opportunistic physical design for big data analytics , 2014, SIGMOD Conference.

[26]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[27]  Sangwon Park Flash-Aware Cost Model for Embedded Database Query Optimizer , 2013, J. Inf. Sci. Eng..

[28]  Hai Liu,et al.  Exploiting Soft and Hard Correlations in Big Data Query Optimization , 2016, Proc. VLDB Endow..

[29]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[30]  Radhya Sahal,et al.  Exploiting coarse-grained reused-based opportunities in Big Data multi-query optimization , 2018, J. Comput. Sci..

[31]  Sabela Ramos,et al.  Multithreaded and Spark parallelization of feature selection filters , 2016, J. Comput. Sci..

[32]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[33]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[34]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[35]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[36]  Caihui Liu,et al.  Hierarchical attribute reduction algorithms for big data using MapReduce , 2015, Knowl. Based Syst..