Making Queries Tractable on Big Data with Preprocessing

A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of Π-tractable queries, denoted by ΠTQ0, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are Π-tractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be effectively converted to Π-tractable queries by refactorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠTQ0 ⊂ P unless P = NC, i.e., the set ΠTQ0 of all Π-tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Π-tractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data.

[1]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[2]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[3]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[4]  Andy Schürr,et al.  Incremental Graph Pattern Matching , 2006 .

[5]  Jeffrey D. Ullman,et al.  Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[6]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7]  Dan Suciu,et al.  A Query Language for NC , 1994, LCC.

[8]  Alejandro López-Ortiz,et al.  Optimal speedup on a low-degree multi-core parallel architecture (LoPRAM) , 2008, SPAA '08.

[9]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[10]  Dan Suciu,et al.  Parallel evaluation of conjunctive queries , 2011, PODS.

[11]  Jian Pei,et al.  Neighbor query friendly compression of social networks , 2010, KDD.

[12]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[13]  Diptikalyan Saha An Incremental Bisimulation Algorithm , 2007, FSTTCS.

[14]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms , 2011, SIAM J. Discret. Math..

[15]  Neil D. Jones,et al.  An introduction to partial evaluation , 1996, CSUR.

[16]  Jörg Flum,et al.  Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[17]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[18]  Francesco M. Donini,et al.  Preprocessing of Intractable Problems , 2002, Inf. Comput..

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Moni Naor,et al.  On the Compressibility of NP Instances and Cryptographic Applications , 2010, SIAM J. Comput..

[21]  Rajeev Motwani,et al.  Clique partitions, graph compression and speeding-up algorithms , 1991, STOC '91.

[22]  Dana Ron,et al.  Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[23]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[24]  Rajeev Motwani,et al.  Clique Partitions, Graph Compression and Speeding-Up Algorithms , 1995, J. Comput. Syst. Sci..

[25]  Jörg Flum,et al.  Parameterized Complexity Theory , 2006, Texts in Theoretical Computer Science. An EATCS Series.

[26]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[27]  Martin Grohe,et al.  The Quest for a Logic Capturing PTIME , 2008, 2008 23rd Annual IEEE Symposium on Logic in Computer Science.

[28]  Jeffrey D. Ullman,et al.  Transitive closure and recursive Datalog implemented on clusters , 2012, EDBT '12.

[29]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[30]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[31]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[32]  Chen Li,et al.  Inside "Big Data management": ogres, onions, or parfaits? , 2012, EDBT '12.

[33]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[34]  Joseph M. Hellerstein,et al.  The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[35]  Moni Naor,et al.  On the Compressibility of NP Instances and Cryptographic Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[36]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[37]  Xin Wang,et al.  Query preserving graph compression , 2012, SIGMOD Conference.

[38]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[39]  Thomas W. Reps,et al.  On the Computational Complexity of Dynamic Graph Problems , 1996, Theor. Comput. Sci..

[40]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[41]  H. James Hoover,et al.  Limits to Parallel Computation: P-Completeness Theory , 1995 .

[42]  David S. Johnson,et al.  A Catalog of Complexity Classes , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.