Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling

We study the problem of estimating the value of sums of the form $$S_p \triangleq \sum \left( {\begin{array}{c}x_i\\ p\end{array}}\right) $$Sp≜∑xip when one has the ability to sample $$x_i \ge 0$$xi≥0 with probability proportional to its magnitude. When $$p=2$$p=2, this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when $$\{x_i\}$${xi} is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a $$(1 \pm \varepsilon )$$(1±ε)-multiplicative approximation of $$S_p$$Sp has query and time complexities $$\mathrm{O}\left( \frac{m \log \log n}{\epsilon ^2 S_p^{1/p}}\right) $$Omloglognϵ2Sp1/p. Here, $$m=\sum x_i/2$$m=∑xi/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when $$\{x_i\}$${xi} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where $$S_p\le n$$Sp≤n, and $$p=2$$p=2, our upper bound is $$\tilde{O}(n/S_p^{1/2})$$O~(n/Sp1/2), in contrast to their $$\varOmega (n/S_p^{1/3})$$Ω(n/Sp1/3) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.

[1]  Ryan Williams,et al.  Finding, Minimizing, and Counting Weighted Subgraphs , 2013, SIAM J. Comput..

[2]  Joshua Brody,et al.  Property Testing Lower Bounds via Communication Complexity , 2011, 2011 IEEE 26th Annual Conference on Computational Complexity.

[3]  Mihail N. Kolountzakis,et al.  Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning , 2010, Internet Math..

[4]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[5]  Thomas Sauerwald,et al.  Counting Arbitrary Subgraphs in Data Streams , 2012, ICALP.

[6]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[7]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[8]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[9]  Peter J. Haas,et al.  Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey , 2009, Stat. Anal. Data Min..

[10]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[11]  David Hales,et al.  Motifs in evolving cooperative networks look like protein structure networks , 2008, Networks Heterog. Media.

[12]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[13]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[14]  Jörg Flum,et al.  The Parameterized Complexity of Counting Problems , 2004, SIAM J. Comput..

[15]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[16]  Dana Ron,et al.  Counting stars and other small subgraphs in sublinear time , 2010, SODA '10.

[17]  Krzysztof Onak,et al.  Local Graph Partitions for Approximation and Testing , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[18]  Sofya Vorotnikova,et al.  Better Algorithms for Counting Triangles in Data Streams , 2016, PODS.

[19]  Omid Amini,et al.  Counting Subgraphs via Homomorphisms , 2009, SIAM J. Discret. Math..

[20]  Rajeev Motwani,et al.  Estimating Sum by Weighted Sampling , 2007, ICALP.

[21]  Noga Alon,et al.  Finding and counting given length cycles , 1997, Algorithmica.

[22]  Yuval Shavitt,et al.  Approximating the Number of Network Motifs , 2009, Internet Math..

[23]  Oded Goldreich On the Communication Complexity Methodology for Proving Lower Bounds on the Query Complexity of Property Testing , 2013, Electron. Colloquium Comput. Complex..

[24]  Uriel Feige,et al.  On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , 2004, STOC '04.

[25]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[26]  Ronitt Rubinfeld,et al.  Testing Probability Distributions Underlying Aggregated Data , 2014, ICALP.

[27]  Ryan Williams,et al.  Finding, minimizing, and counting weighted subgraphs , 2009, STOC '09.

[28]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[29]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[30]  Kurt Mehlhorn,et al.  Approximate Counting of Cycles in Streams , 2011, ESA.

[31]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[32]  Sudipto Guha,et al.  Graph sketches: sparsification, spanners, and subgraphs , 2012, PODS.

[33]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[34]  Dana Ron,et al.  Approximately Counting Triangles in Sublinear Time , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[35]  Krzysztof Onak,et al.  A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size , 2011, SODA.

[36]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[37]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[38]  Ryan Williams,et al.  Finding paths of length k in O*(2k) time , 2008, Inf. Process. Lett..

[39]  Christian Sohler,et al.  A sublinear-time approximation scheme for bin packing , 2009, Theor. Comput. Sci..

[40]  Dana Ron,et al.  On Approximating the Minimum Vertex Cover in Sublinear Time and the Connection to Distributed Algorithms , 2007, Electron. Colloquium Comput. Complex..

[41]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[42]  Krzysztof Onak,et al.  Constant-Time Approximation Algorithms via Local Improvements , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[43]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[44]  Yuichi Yoshida,et al.  Improved Constant-Time Approximation Algorithms for Maximum Matchings and Other Optimization Problems , 2012, SIAM J. Comput..

[45]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[46]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[47]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2006, J. Comput. Biol..

[48]  Vojtech Rödl,et al.  A Fast Approximation Algorithm for Computing the Frequencies of Subgraphs in a Given Graph , 1995, SIAM J. Comput..

[49]  Roded Sharan,et al.  QPath: a method for querying pathways in a protein-protein interaction network , 2006, BMC Bioinformatics.

[50]  Dana Ron,et al.  Approximating average parameters of graphs , 2008, Random Struct. Algorithms.

[51]  Süleyman Cenk Sahinalp,et al.  Not All Scale-Free Networks Are Born Equal: The Role of the Seed Graph in PPI Network Evolution , 2006, Systems Biology and Computational Proteomics.

[52]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  Noga Alon,et al.  Balanced Hashing, Color Coding and Approximate Counting , 2009, IWPEC.

[54]  Fedor V. Fomin,et al.  Faster algorithms for finding and counting subgraphs , 2009, J. Comput. Syst. Sci..

[55]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..