Quality and efficiency for kernel density estimates in large data

Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques.

[1]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[2]  Andrew W. Appel,et al.  An Efficient Program for Many-Body Simulation , 1983 .

[3]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[4]  Angel R. Martinez,et al.  Computational Statistics Handbook with MATLAB, Second Edition (Chapman & Hall/Crc Computer Science & Data Analysis) , 2007 .

[5]  Leslie Greengard,et al.  The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[6]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[7]  Jirí Matousek,et al.  Approximations and optimal geometric divide-and-conquer , 1991, STOC '91.

[8]  Qin Zhang,et al.  Optimal sampling from distributed streams , 2010, PODS '10.

[9]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries (extended abstract) , 2000, PODS '00.

[10]  Lu Wang,et al.  Sampling based algorithms for quantile computation in sensor networks , 2011, SIGMOD '11.

[11]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[12]  Csaba D. Tóth,et al.  Range Counting over Multidimensional Data Streams , 2004, SCG '04.

[13]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[14]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[15]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[16]  Pankaj K. Agarwal,et al.  CRB-Tree: An Efficient Indexing Scheme for Range-Aggregate Queries , 2003, ICDT.

[17]  Jeff M. Phillips,et al.  Є-Samples for Kernels , 2013, SODA.

[18]  H. Kile,et al.  Bandwidth Selection in Kernel Density Estimation , 2010 .

[19]  Suresh Venkatasubramanian,et al.  Comparing distributions and shapes using the kernel distance , 2010, SoCG '11.

[20]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[21]  Tian Zhang,et al.  Fast density estimation using CF-kernel for very large databases , 1999, KDD '99.

[22]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[23]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[24]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[25]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[26]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[27]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS '06.

[28]  Bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[29]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[30]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[31]  Ran Duan,et al.  Scaling algorithms for approximate and exact maximum weight matching , 2011, ArXiv.

[32]  Vladimir Kolmogorov,et al.  Blossom V: a new implementation of a minimum cost perfect matching algorithm , 2009, Math. Program. Comput..

[33]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[34]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[35]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[36]  Kasturi R. Varadarajan A divide-and-conquer algorithm for min-cost perfect matching in the plane , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[37]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[38]  David Eppstein,et al.  Deterministic sampling and range counting in geometric data streams , 2003, TALG.

[39]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[40]  Larry S. Davis,et al.  Improved fast gauss transform and efficient kernel density estimation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41]  S. Rao Kosaraju,et al.  Algorithms for dynamic closest pair and n-body potential fields , 1995, SODA '95.

[42]  Hongjun Lu,et al.  Approximate processing of massive continuous quantile queries over high-speed data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[43]  J. Edmonds Paths, Trees, and Flowers , 1965, Canadian Journal of Mathematics.

[44]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[45]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[46]  Bernard Chazelle,et al.  The Discrepancy Method , 1998, ISAAC.

[47]  Ramani Duraiswami,et al.  Fast optimal bandwidth selection for kernel density estimation , 2006, SDM.

[48]  V. Raykar,et al.  Fast Computation of Kernel Estimators , 2010 .

[49]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[50]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[51]  Bernard Chazelle,et al.  On linear-time deterministic algorithms for optimization problems in fixed dimension , 1996, SODA '93.

[52]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.