Algorithmic Aspects of Parallel Query Processing

In the last decade we have witnessed a growing interest in process- ing large data sets on large-scale distributed clusters. A big part of the complex data analysis pipelines performed by these systems consists of a sequence of relatively simple query operations, such as joining two or more tables, or sorting. This tutorial discusses several recent algorithmic developments for data processing in such large distributed clusters. It uses as a model of computation the Massively Parallel Computation (MPC) model, a simplification of the BSP model, where the only cost is given by the amount of communication and the number of communication rounds. Based on the MPC model, we study and analyze several algorithms for three core data processing tasks: multiway join queries, sorting and matrix multiplication. We discuss the common algorithmic techniques across all tasks, relate the algorithms to what is used in practical systems, and finally present open problems for future research.

[1]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[2]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[3]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[4]  Dan Suciu,et al.  A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries , 2017, PODS.

[5]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[6]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[7]  Dan Suciu,et al.  Worst-Case Optimal Algorithms for Parallel Query Processing , 2016, ICDT.

[8]  Mikhail J. Atallah,et al.  Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.

[9]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[12]  Dan Suciu,et al.  Communication Steps for Parallel Query Processing , 2017, J. ACM.

[13]  Guy E. Blelloch,et al.  Parallel algorithms , 1996, CSUR.

[14]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[15]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[16]  Mihalis Yannakakis,et al.  Algorithms for Acyclic Database Schemes , 1981, VLDB.

[17]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[18]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.

[19]  Dan Suciu,et al.  Algorithmic Aspects of Parallel Data Processing , 2018, Found. Trends Databases.

[20]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[23]  Dan Suciu,et al.  The Myria Big Data Management and Analytics System and Cloud Services , 2017, CIDR.

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Dan Suciu Communication Cost in Parallel Query Evaluation: A Tutorial , 2017, PODS.

[26]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[27]  Sergei Vassilvitskii,et al.  Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation) , 2016, SPAA.

[28]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[29]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[30]  Surajit Chaudhuri,et al.  What next?: a half-dozen data management research goals for big data and the cloud , 2012, PODS '12.

[31]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[32]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..

[33]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[34]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[35]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[36]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[37]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[38]  Yufei Tao,et al.  Output-optimal Parallel Algorithms for Similarity Joins , 2017, PODS.