论文信息 - Making Queries Tractable on Big Data with Preprocessing

Making Queries Tractable on Big Data with Preprocessing

A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of Π-tractable queries, denoted by ΠTQ0, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are Π-tractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be effectively converted to Π-tractable queries by refactorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠTQ0 ⊂ P unless P = NC, i.e., the set ΠTQ0 of all Π-tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Π-tractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data.

[1] Sergei Vassilvitskii,et al. A model of computation for MapReduce , 2010, SODA '10.

[2] Nisheeth Shrivastava,et al. Graph summarization with bounded error , 2008, SIGMOD Conference.

[3] Moni Naor,et al. Optimal aggregation algorithms for middleware , 2001, PODS '01.

[4] Andy Schürr,et al. Incremental Graph Pattern Matching , 2006 .

[5] Jeffrey D. Ullman,et al. Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[6] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7] Dan Suciu,et al. A Query Language for NC , 1994, LCC.

[8] Alejandro López-Ortiz,et al. Optimal speedup on a low-degree multi-core parallel architecture (LoPRAM) , 2008, SPAA '08.

[9] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[10] Dan Suciu,et al. Parallel evaluation of conjunctive queries , 2011, PODS.

[11] Jian Pei,et al. Neighbor query friendly compression of social networks , 2010, KDD.

[12] Alon Y. Halevy,et al. Answering queries using views: A survey , 2001, The VLDB Journal.

[13] Diptikalyan Saha. An Incremental Bisimulation Algorithm , 2007, FSTTCS.

[14] Ronitt Rubinfeld,et al. Sublinear Time Algorithms , 2011, SIAM J. Discret. Math..

[15] Neil D. Jones,et al. An introduction to partial evaluation , 1996, CSUR.

[16] Jörg Flum,et al. Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[17] Volker Heun,et al. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[18] Francesco M. Donini,et al. Preprocessing of Intractable Problems , 2002, Inf. Comput..

[19] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20] Moni Naor,et al. On the Compressibility of NP Instances and Cryptographic Applications , 2010, SIAM J. Comput..

[21] Rajeev Motwani,et al. Clique partitions, graph compression and speeding-up algorithms , 1991, STOC '91.

[22] Dana Ron,et al. Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[23] Maurizio Lenzerini,et al. Data integration: a theoretical perspective , 2002, PODS.

[24] Rajeev Motwani,et al. Clique Partitions, Graph Compression and Speeding-Up Algorithms , 1995, J. Comput. Syst. Sci..

[25] Jörg Flum,et al. Parameterized Complexity Theory , 2006, Texts in Theoretical Computer Science. An EATCS Series.

[26] Serge Abiteboul,et al. Foundations of Databases , 1994 .

[27] Martin Grohe,et al. The Quest for a Logic Capturing PTIME , 2008, 2008 23rd Annual IEEE Symposium on Logic in Computer Science.

[28] Jeffrey D. Ullman,et al. Transitive closure and recursive Datalog implemented on clusters , 2012, EDBT '12.

[29] Jeffrey D. Ullman,et al. Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[30] Silvio Lattanzi,et al. On compressing social networks , 2009, KDD.

[31] Marco Rosa,et al. Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[32] Chen Li,et al. Inside "Big Data management": ogres, onions, or parfaits? , 2012, EDBT '12.

[33] Raghu Ramakrishnan,et al. Database Management Systems , 1976 .

[34] Joseph M. Hellerstein,et al. The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[35] Moni Naor,et al. On the Compressibility of NP Instances and Cryptographic Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[36] Ramesh Subramonian,et al. LogP: a practical model of parallel computation , 1996, CACM.

[37] Xin Wang,et al. Query preserving graph compression , 2012, SIGMOD Conference.

[38] Steven Skiena,et al. Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[39] Thomas W. Reps,et al. On the Computational Complexity of Dynamic Graph Problems , 1996, Theor. Comput. Sci..

[40] Salil P. Vadhan,et al. Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[41] H. James Hoover,et al. Limits to Parallel Computation: P-Completeness Theory , 1995 .

[42] David S. Johnson,et al. A Catalog of Complexity Classes , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.