Sublinear-Time Algorithms for Counting Star Subgraphs with Applications to Join Selectivity Estimation

We study the problem of estimating the value of sums of the form $S_p \triangleq \sum \binom{x_i}{p}$ when one has the ability to sample $x_i \geq 0$ with probability proportional to its magnitude. When $p=2$, this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when $\{x_i\}$ is the degree sequence of a graph, which corresponds to counting the number of $p$-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a $(1 \pm \varepsilon)$-multiplicative approximation of $S_p$ has query and time complexities $\O(\frac{m \log \log n}{\epsilon^2 S_p^{1/p}})$. Here, $m=\sum x_i/2$ is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, $n$ is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when $\{x_i\}$ is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only \emph{vertices} uniformly gave algorithms with matching lower bounds [Gonen, Ron, and Shavitt. \textit{SIAM J. Comput.}, 25 (2011), pp. 1365-1411]. With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where $S_p\leq n$, and $p=2$, our upper bound is $\tilde{O}(n/S_p^{1/2})$, in contrast to their $\Omega(n/S_p^{1/3})$ lower bound when no random edge queries are available.

[1]  Dana Ron,et al.  Approximately Counting Triangles in Sublinear Time , 2017, SIAM J. Comput..

[2]  Mihail N. Kolountzakis,et al.  Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning , 2012, Internet Math..

[3]  Mihail N. Kolountzakis,et al.  Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning , 2010, Internet Math..

[4]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[5]  Noga Alon,et al.  Finding and counting given length cycles , 1997, Algorithmica.

[6]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[7]  Uriel Feige,et al.  On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , 2004, STOC '04.

[8]  Ronitt Rubinfeld,et al.  Testing Probability Distributions Underlying Aggregated Data , 2014, ICALP.

[9]  Venkatesh Medabalimi Property Testing Lower Bounds via Communication Complexity , 2012 .

[10]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[11]  Dana Ron,et al.  On Approximating the Minimum Vertex Cover in Sublinear Time and the Connection to Distributed Algorithms , 2007, Electron. Colloquium Comput. Complex..

[12]  Süleyman Cenk Sahinalp,et al.  Not All Scale-Free Networks Are Born Equal: The Role of the Seed Graph in PPI Network Evolution , 2006, Systems Biology and Computational Proteomics.

[13]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[14]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[15]  Jörg Flum,et al.  The Parameterized Complexity of Counting Problems , 2004, SIAM J. Comput..

[16]  Thomas Sauerwald,et al.  Counting Arbitrary Subgraphs in Data Streams , 2012, ICALP.

[17]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[18]  Christian Sohler,et al.  A sublinear-time approximation scheme for bin packing , 2009, Theor. Comput. Sci..

[19]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[20]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[21]  Vojtech Rödl,et al.  A Fast Approximation Algorithm for Computing the Frequencies of Subgraphs in a Given Graph , 1995, SIAM J. Comput..

[22]  Roded Sharan,et al.  QPath: a method for querying pathways in a protein-protein interaction network , 2006, BMC Bioinformatics.

[23]  Dana Ron,et al.  Approximating average parameters of graphs , 2008, Random Struct. Algorithms.

[24]  Sudipto Guha,et al.  Graph sketches: sparsification, spanners, and subgraphs , 2012, PODS.

[25]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[26]  Dana Ron,et al.  Counting stars and other small subgraphs in sublinear time , 2010, SODA '10.

[27]  C. Seshadhri,et al.  A simpler sublinear algorithm for approximating the triangle count , 2015, ArXiv.

[28]  Krzysztof Onak,et al.  A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size , 2011, SODA.

[29]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[30]  Omid Amini,et al.  Counting Subgraphs via Homomorphisms , 2009, SIAM J. Discret. Math..

[31]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[32]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[33]  F IlyasIhab,et al.  Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey , 2009 .

[34]  Krzysztof Onak,et al.  Local Graph Partitions for Approximation and Testing , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[35]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[36]  Peter J. Haas,et al.  Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey , 2009, Stat. Anal. Data Min..

[37]  Ryan Williams,et al.  Finding, Minimizing, and Counting Weighted Subgraphs , 2013, SIAM J. Comput..

[38]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[39]  Ryan Williams,et al.  Finding paths of length k in O*(2k) time , 2008, Inf. Process. Lett..

[40]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[41]  Kurt Mehlhorn,et al.  Approximate Counting of Cycles in Streams , 2011, ESA.

[42]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[43]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Noga Alon,et al.  Balanced Hashing, Color Coding and Approximate Counting , 2009, IWPEC.

[45]  Fedor V. Fomin,et al.  Faster algorithms for finding and counting subgraphs , 2009, J. Comput. Syst. Sci..

[46]  Krzysztof Onak,et al.  Constant-Time Approximation Algorithms via Local Improvements , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[47]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[48]  Rajeev Motwani,et al.  Estimating Sum by Weighted Sampling , 2007, ICALP.

[49]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[50]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2006, J. Comput. Biol..

[51]  David Hales,et al.  Motifs in evolving cooperative networks look like protein structure networks , 2008, Networks Heterog. Media.

[52]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[53]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[54]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[55]  Yuval Shavitt,et al.  Approximating the Number of Network Motifs , 2009, Internet Math..

[56]  Oded Goldreich On the Communication Complexity Methodology for Proving Lower Bounds on the Query Complexity of Property Testing , 2013, Electron. Colloquium Comput. Complex..

[57]  Yuichi Yoshida,et al.  An improved constant-time approximation algorithm for maximum~matchings , 2009, STOC '09.