Parallel evaluation of conjunctive queries

The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y), S(x,z), T(x,v), U(x,w), tall queries like R(x), S(x,y), T(x,y,z), U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[3]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[4]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[5]  David Maier,et al.  Dedalus: Datalog in Time and Space , 2010, Datalog.

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[8]  Sara Cohen Containment of aggregate queries , 2005, SGMD.

[9]  Jan Van den Bussche,et al.  Database Query Processing Using Finite Cursor Machines , 2007, ICDT.

[10]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[11]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[12]  Joseph M. Hellerstein,et al.  The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[13]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[14]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[15]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[16]  Neil Immerman,et al.  Expressibility and Parallel Complexity , 1989, SIAM J. Comput..

[17]  Martin Raab,et al.  "Balls into Bins" - A Simple and Tight Analysis , 1998, RANDOM.

[18]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[19]  Uzi Vishkin,et al.  Simulation of Parallel Random Access Machines by Circuits , 1984, SIAM J. Comput..

[20]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[21]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[22]  Leonid Libkin,et al.  Elements Of Finite Model Theory (Texts in Theoretical Computer Science. An Eatcs Series) , 2004 .

[23]  Anna Pagh,et al.  Uniform Hashing in Constant Time and Optimal Space , 2008, SIAM J. Comput..

[24]  Leonid Libkin,et al.  Elements of Finite Model Theory , 2004, Texts in Theoretical Computer Science.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.