Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem

Given a set of n d-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson correlation coefficient ρ (Hamming distance d· 1-ρ/ 2), how quickly can one find the two correlated vectors? We present an algorithm which, for any constant ϵ>0, and constant ρ>0, runs in expected time O(n5-ω / 4-ω+ϵ +nd) < O(n1.62 +nd), where ω < 2.4 is the exponent of matrix multiplication. This is the first subquadratic--time algorithm for this problem for which ρ does not appear in the exponent of n, and improves upon O(n2-O(ρ)), given by Paturi et al. [1989], the Locality Sensitive Hashing approach of Motwani [1998] and the Bucketing Codes approach of Dubiner [2008]. Applications and extensions of this basic algorithm yield significantly improved algorithms for several other problems. Approximate Closest Pair. For any sufficiently small constant ϵ>0, given n d-dimensional vectors, there exists an algorithm that returns a pair of vectors whose Euclidean (or Hamming) distance differs from that of the closest pair by a factor of at most 1+ϵ, and runs in time O(n2-Θ(√ϵ)). The best previous algorithms (including Locality Sensitive Hashing) have runtime O(n2-O(ϵ)). Learning Sparse Parities with Noise. Given samples from an instance of the learning parities with noise problem where each example has length n, the true parity set has size at most k « n, and the noise rate is η, there exists an algorithm that identifies the set of k indices in time nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η). This is the first algorithm with no dependence on η in the exponent of n, aside from the trivial O((nk)) ≈ O(nk) brute-force algorithm, and for large noise rates (η > 0.4), improves upon the results of Grigorescu et al. [2011] that give a runtime of n(1+(2 η)2 + o(1))k/2 poly(1/1-2η). Learning k-Juntas with Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of just k « n of the bits, perturbed by noise rate η, return the set of relevant indices. Leveraging the reduction of Feldman et al. [2009], our result for learning k-parities implies an algorithm for this problem with runtime nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η), which is the first runtime for this problem of the form nck with an absolute constant c < 1. Learning k-Juntas without Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of k « n of the bits, return the set of relevant indices. Using a modification of the algorithm of Mossel et al. [2004], and employing our algorithm for learning sparse parities with noise via the reduction of Feldman et al. [2009], we obtain an algorithm for this problem with runtime nω+ ϵ/4 k poly(n) < n0.6k poly(n), which improves on the previous best of nω+1/ωk ≈ n0.7k poly(n) of Mossel et al. [2004].

[1]  Ravi Kumar,et al.  A sieve algorithm for the shortest lattice vector problem , 2001, STOC '01.

[2]  O. Regev The Learning with Errors problem , 2010 .

[3]  Gregory Valiant,et al.  Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[4]  Vinod Vaikuntanathan,et al.  Efficient Fully Homomorphic Encryption from (Standard) LWE , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[5]  Leslie G. Valiant,et al.  Functionality in neural nets , 1988, COLT '88.

[6]  Oded Regev,et al.  The Learning with Errors Problem (Invited Survey) , 2010, 2010 IEEE 25th Annual Conference on Computational Complexity.

[7]  Qiang Yang,et al.  Detecting two-locus associations allowing for interactions in genome-wide association studies , 2010, Bioinform..

[8]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[9]  S. Meiser,et al.  Point Location in Arrangements of Hyperplanes , 1993, Inf. Comput..

[10]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[11]  Manuel Blum,et al.  Secure Human Identification Protocols , 2001, ASIACRYPT.

[12]  Moshe Dubiner,et al.  Bucketing Coding and Information Theory for the Statistical High-Dimensional Nearest-Neighbor Problem , 2008, IEEE Transactions on Information Theory.

[13]  Sanguthevar Rajasekaran,et al.  The light bulb problem , 1995, COLT '89.

[14]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[15]  Vitaly Feldman,et al.  New Results for Learning Noisy Parities and Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  T. J. Rivlin The Chebyshev polynomials , 1974 .

[17]  Oded Regev,et al.  On lattices, learning with errors, random linear codes, and cryptography , 2005, STOC '05.

[18]  Chris Peikert,et al.  Public-key cryptosystems from the worst-case shortest vector problem: extended abstract , 2009, STOC '09.

[19]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[20]  Ryan O'Donnell,et al.  Learning functions of k relevant variables , 2004, J. Comput. Syst. Sci..

[21]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[22]  Vitaly Feldman,et al.  On Agnostic Learning of Parities, Monomials, and Halfspaces , 2009, SIAM J. Comput..

[23]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[24]  Rene F. Swarttouw,et al.  Orthogonal polynomials , 2020, NIST Handbook of Mathematical Functions.

[25]  Karsten A. Verbeurgt Learning DNF under the uniform distribution in quasi-polynomial time , 1990, COLT '90.

[26]  A. Ron,et al.  Strictly positive definite functions on spheres in Euclidean spaces , 1994, Math. Comput..

[27]  Rasmus Pagh,et al.  Compressed matrix multiplication , 2011, ITCS '12.

[28]  Noga Alon,et al.  Approximating the cut-norm via Grothendieck's inequality , 2004, STOC '04.

[29]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[30]  Russell Impagliazzo,et al.  How to recycle random bits , 1989, 30th Annual Symposium on Foundations of Computer Science.

[31]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[32]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[33]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[34]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[35]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[36]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[37]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[38]  Vadim Lyubashevsky,et al.  The Parity Problem in the Presence of Noise, Decoding Random Linear Codes, and the Subset Sum Problem , 2005, APPROX-RANDOM.

[39]  Kenneth L. Clarkson,et al.  A Randomized Algorithm for Closest-Point Queries , 1988, SIAM J. Comput..

[40]  Yishay Mansour,et al.  Weakly learning DNF and characterizing statistical query learning using Fourier analysis , 1994, STOC '94.

[41]  Santosh S. Vempala,et al.  On Noise-Tolerant Learning of Sparse Parities and Related Problems , 2011, ALT.

[42]  Sanjeev Arora,et al.  New Algorithms for Learning in Presence of Errors , 2011, ICALP.

[43]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[44]  Don Coppersmith,et al.  Rectangular Matrix Multiplication Revisited , 1997, J. Complex..