New Algorithmic Tools for Distributed Similarity Search and Edge Estimation

New Algorithmic Tools for Distributed Similarity Search and Edge Estimation

[1]  Rudolf Ahlswede,et al.  A Counterexample To Kleitman's Conjecture Concerning An Edge-Isoperimetric Problem , 1999, Comb. Probab. Comput..

[2]  D. Conlon,et al.  An Approximate Version of Sidorenko’s Conjecture , 2010, 1004.4236.

[3]  B. Bollobás Combinatorics: Set Systems, Hypergraphs, Families of Vectors and Combinatorial Probability , 1986 .

[4]  Ryan Williams,et al.  Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made , 2015, STOC.

[5]  Michal Koucký,et al.  Streaming algorithms for embedding and computing edit distance in the low distance regime , 2016, STOC.

[6]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[7]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[8]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[9]  D. Kleitman On a combinatorial conjecture of Erdös , 1966 .

[10]  Qin Zhang,et al.  EmbedJoin: Efficient Edit Similarity Joins via Embeddings , 2017, KDD.

[11]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .

[12]  Haim Kaplan,et al.  Reporting Neighbors in High-Dimensional Euclidean Space , 2013, SIAM J. Comput..

[13]  Alexandr Andoni,et al.  The Computational Hardness of Estimating Edit Distance , 2010 .

[14]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[15]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[18]  Ilya P. Razenshteyn High-dimensional similarity search and sketching: algorithms and hardness , 2017 .

[19]  Rasmus Pagh,et al.  Scalability and Total Recall with Fast CoveringLSH , 2016, CIKM.

[20]  Rina Panigrahy,et al.  Lower Bounds on Near Neighbor Search via Metric Expansion , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[21]  Cyrus Rashtchian,et al.  Edge Estimation with Independent Set Oracles , 2017, ITCS.

[22]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[23]  Sanjoy Dasgupta,et al.  Incremental Clustering: The Case for Extra Clusters , 2014, NIPS.

[24]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[25]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[26]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[27]  Alessandro Panconesi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms , 2009 .

[28]  Guoliang Li,et al.  Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints , 2013, EDBT '13.

[29]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2007, JACM.

[30]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[31]  Man Lung Yiu,et al.  Identifying the Most Connected Vertices in Hidden Bipartite Graphs Using Group Testing , 2013, IEEE Transactions on Knowledge and Data Engineering.

[32]  Huzefa Rangwala,et al.  Efficient Clustering of Metagenomic Sequences using Locality Sensitive Hashing , 2012, SDM.

[33]  L. H. Harper Optimal Assignments of Numbers to Vertices , 1964 .

[34]  Shai Ben-David,et al.  Clustering Oligarchies , 2013, AISTATS.

[35]  Mikkel Thorup High Speed Hashing for Integers and Strings , 2015, ArXiv.

[36]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[37]  Alexandr Andoni,et al.  Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing , 2015, SoCG.

[38]  Atsuyoshi Nakamura,et al.  On Practical Accuracy of Edit Distance Approximation Algorithms , 2017, ArXiv.

[39]  Maria-Florina Balcan,et al.  Robust hierarchical clustering , 2013, J. Mach. Learn. Res..

[40]  Rina Panigrahy,et al.  A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[41]  Alexandr Andoni,et al.  Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors , 2016, SODA.

[42]  Will Rosenbaum,et al.  On Sampling Edges Almost Uniformly , 2017, SOSA.

[43]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[44]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[45]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[46]  Rasmus Pagh,et al.  On the Complexity of Inner Product Similarity Join , 2015, PODS.

[47]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[48]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[49]  LihChyun Shu,et al.  Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis , 2013, CIKM.

[50]  Rasmus Pagh Locality-sensitive Hashing without False Negatives , 2016, SODA.

[51]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[52]  Rasmus Pagh,et al.  I/O-Efficient Similarity Join , 2017, Algorithmica.

[53]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[54]  Dana Ron,et al.  Approximately Counting Triangles in Sublinear Time , 2017, SIAM J. Comput..

[55]  Larry J. Stockmeyer,et al.  On Approximation Algorithms for #P , 1985, SIAM J. Comput..

[56]  David Conlon,et al.  Finite reflection groups and graph norms , 2016, 1611.05784.

[57]  Rudolf Ahlswede,et al.  Appendix: On Edge-Isoperimetric Theorems for Uniform Hypergraphs , 2006, GTIT-C.

[58]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2017 .

[59]  L. H. Harper On a problem of Kleitman and West , 1991, Discret. Math..

[60]  Andrew C. Yao,et al.  Lower bounds by probabilistic arguments , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[61]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[62]  William H. Swallow,et al.  Group testing for estimating infection rates and probabilities of disease transmission , 1985 .

[63]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[64]  Terence Tao,et al.  A new bound on partial sum-sets and difference-sets, and applications to the Kakeya conjecture , 1999 .

[65]  Larry J. Stockmeyer The Complexity of Approximate Counting (Preliminary Version) , 1983, STOC 1983.

[66]  Elchanan Mossel,et al.  Sequence assembly from corrupted shotgun reads , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[67]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[68]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[69]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[70]  Alexander Sidorenko,et al.  A correlation inequality for bipartite graphs , 1993, Graphs Comb..

[71]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[72]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[73]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[74]  Nikhil Bansal,et al.  Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[75]  Noga Alon,et al.  Non-averaging Subsets and Non-vanishing Transversals , 1999, J. Comb. Theory, Ser. A.

[76]  B. Lindström,et al.  A Generalization of a Combinatorial Theorem of Macaulay , 1969 .

[77]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[78]  Alexandr Andoni,et al.  The Smoothed Complexity of Edit Distance , 2008, ICALP.

[79]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[80]  Yi Sun,et al.  Hash ^ed -Join: Approximate String Similarity Join with Hashing , 2014, DASFAA Workshops.

[81]  C. Seshadhri,et al.  A simpler sublinear algorithm for approximating the triangle count , 2015, ArXiv.

[82]  Sreenivas Gollapudi,et al.  A dictionary for approximate string search and longest prefix search , 2006, CIKM '06.

[83]  Aleksei V. Fishkin,et al.  Disk Graphs: A Short Survey , 2003, WAOA.

[84]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[85]  Michiel H. M. Smid,et al.  Sequential and parallel algorithms for the k closest pairs problem , 1995, Int. J. Comput. Geom. Appl..

[86]  Holger Dell,et al.  Fine-grained reductions from approximate counting to decision , 2017, STOC.

[87]  I. Anderson Combinatorics of Finite Sets , 1987 .

[88]  Ryan Williams,et al.  Probabilistic Polynomials and Hamming Nearest Neighbors , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[89]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[90]  Gustavo Malkomes,et al.  Fast Distributed k-Center Clustering with Outliers on Massive Data , 2015, NIPS.

[91]  Timothy M. Chan,et al.  Polynomial Representations of Threshold Functions and Algorithmic Applications , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[92]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[93]  Krzysztof Onak,et al.  A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size , 2011, SODA.

[94]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[95]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[96]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[97]  Boris Aronov,et al.  On approximating the depth and related problems , 2005, SODA '05.

[98]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[99]  Sudipto Guha,et al.  Distributed Partial Clustering , 2017, SPAA.

[100]  Hanna M. Wallach,et al.  Flexible Models for Microclustering with Application to Entity Resolution , 2016, NIPS.

[101]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[102]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[103]  Ronitt Rubinfeld,et al.  A sublinear algorithm for weakly approximating edit distance , 2003, STOC '03.

[104]  David P. Woodruff,et al.  Communication-Optimal Distributed Clustering , 2016, NIPS.

[105]  Alon Orlitsky,et al.  Estimating the number of defectives with group testing , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[106]  Tselil Schramm,et al.  Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs , 2014, STOC.

[107]  Dana Ron,et al.  Approximating average parameters of graphs , 2008, Random Struct. Algorithms.

[108]  Sergio Cabello,et al.  Shortest paths in intersection graphs of unit disks , 2014, Comput. Geom..

[109]  Robert Krauthgamer,et al.  Embedding the Ulam metric into l1 , 2006, Theory Comput..

[110]  Yongfeng Huang,et al.  Efficient string similarity join in multi-core and distributed systems , 2017, PloS one.

[111]  Aravindan Vijayaraghavan,et al.  Bilu-Linial Stable Instances of Max Cut and Minimum Multiway Cut , 2013, SODA.

[112]  Nathan Linial,et al.  The influence of variables on Boolean functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[113]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[114]  Dana Ron,et al.  On approximating the number of k-cliques in sublinear time , 2017, STOC.

[115]  Yuval Rabani,et al.  Improved lower bounds for embeddings into L1 , 2006, SODA '06.

[116]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[117]  Ravishankar Krishnaswamy,et al.  Relax, No Need to Round: Integrality of Clustering Formulations , 2014, ITCS.

[118]  L. H. Harper Global Methods for Combinatorial Isoperimetric Problems , 2004 .

[119]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[120]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[121]  W. Swallow,et al.  Using group testing to estimate a proportion, and to test the binomial model. , 1990, Biometrics.

[122]  Anna Pagh,et al.  Linear probing with constant independence , 2006, STOC '07.

[123]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[124]  Dana Ron,et al.  Counting stars and other small subgraphs in sublinear time , 2010, SODA '10.

[125]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[126]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[127]  A. J. Bernstein,et al.  Maximally Connected Arrays on the n-Cube , 1967 .

[128]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[129]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[130]  Dana Ron,et al.  Comparing the strength of query types in property testing: The case of k-colorability , 2012, computational complexity.

[131]  Dana Ron,et al.  The Power of an Example , 2014, ACM Trans. Comput. Theory.

[132]  B. Szegedy,et al.  On the logarithimic calculus and Sidorenko's conjecture , 2011, 1107.1153.

[133]  Rudolf Ahlswede,et al.  Contributions to the geometry of hamming spaces , 1977, Discret. Math..

[134]  Jeffrey D. Ullman,et al.  Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce , 2014, ICDT.

[135]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[136]  John H. Lindsey,et al.  Assignment of Numbers to Vertices , 1964 .

[137]  Sergiu Hart,et al.  A note on the edges of the n-cube , 1976, Discret. Math..

[138]  R. Ahlswede,et al.  Graphs with maximal number of adjacent pairs of edges , 1978 .

[139]  Cyrus Rashtchian,et al.  Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube , 2016, SODA.

[140]  Uriel Feige,et al.  On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , 2004, STOC '04.

[141]  Béla Bollobás,et al.  Sums in the grid , 1996, Discret. Math..

[142]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[143]  Dan Suciu,et al.  Communication Steps for Parallel Query Processing , 2017, J. ACM.

[144]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[145]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[146]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[147]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[148]  P. Erdös,et al.  INTERSECTION THEOREMS FOR SYSTEMS OF FINITE SETS , 1961 .