Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques.

[1]  Nir Ailon,et al.  An almost optimal unrestricted fast Johnson-Lindenstrauss transform , 2010, SODA '11.

[2]  David P. Woodruff,et al.  Applications of the Shannon-Hartley theorem to data streams and sparse recovery , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[3]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[4]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[5]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[6]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[7]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[8]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.

[9]  David P. Woodruff,et al.  Turnstile streaming algorithms might as well be linear sketches , 2014, STOC.

[10]  Anirban Dasgupta,et al.  A sparse Johnson: Lindenstrauss transform , 2010, STOC '10.

[11]  Pankaj K. Agarwal,et al.  Streaming Algorithms for Extent Problems in High Dimensions , 2010, SODA '10.

[12]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[13]  David P. Woodruff,et al.  Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error , 2011, TALG.

[14]  Christian Sohler,et al.  Random projections for Bayesian regression , 2015, Statistics and Computing.

[15]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[16]  Alexander J. Smola,et al.  Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[17]  Kasper Green Larsen,et al.  Optimality of the Johnson-Lindenstrauss Lemma , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[18]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[19]  Eyal Kushilevitz,et al.  Communication Complexity , 1997, Adv. Comput..

[20]  Farid M. Ablayev,et al.  Lower Bounds for One-Way Probabilistic Communication Complexity and Their Application to Space Complexity , 1996, Theor. Comput. Sci..

[21]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[23]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[24]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[25]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[26]  Christian Sohler,et al.  BICO: BIRCH Meets Coresets for k-Means Clustering , 2013, ESA.

[27]  G. Pisier The volume of convex bodies and Banach space geometry , 1989 .

[28]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[29]  Daniel M. Kane,et al.  Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[30]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[31]  Christian Sohler,et al.  Asymptotically exact streaming algorithms , 2014, ArXiv.

[32]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[33]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[34]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[35]  Kasper Green Larsen,et al.  The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction , 2014, ICALP.

[36]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[37]  David P. Woodruff,et al.  Subspace Embeddings and \(\ell_p\)-Regression Using Exponential Random Variables , 2013, COLT.

[38]  Christos Boutsidis,et al.  Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[39]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[40]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[41]  Kristian Kersting,et al.  Core Dependency Networks , 2018, AAAI.

[42]  Michael W. Mahoney,et al.  Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments , 2015, Proceedings of the IEEE.

[43]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[44]  Nir Ailon,et al.  Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes , 2008, SODA '08.

[45]  Dan Feldman,et al.  Smallest enclosing ball for probabilistic data , 2014, SoCG.

[46]  Pankaj K. Agarwal,et al.  A space-optimal data-stream algorithm for coresets in the plane , 2007, SCG '07.

[47]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[48]  Huy L. Nguyen,et al.  Lower Bounds for Oblivious Subspace Embeddings , 2013, ICALP.

[49]  David Heckerman,et al.  Dependency Networks for Density Estimation, Collaborative Filtering, and Data Visualization , 2000 .

[50]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[51]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[52]  Timothy M. Chan Faster core-set constructions and data-stream algorithms in fixed dimensions , 2006, Comput. Geom..

[53]  Timothy M. Chan Approximating the Diameter, Width, Smallest Enclosing Cylinder, and Minimum-Width Annulus , 2002, Int. J. Comput. Geom. Appl..

[54]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[55]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[56]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[57]  Michael W. Mahoney,et al.  Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares , 2015, ICML.

[58]  Alexandr Andoni,et al.  High frequency moments via max-stability , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[60]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[61]  Michael A. Saunders,et al.  LSRN: A Parallel Iterative Solver for Strongly Over- or Underdetermined Systems , 2011, SIAM J. Sci. Comput..

[62]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[63]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[64]  Dan Roth,et al.  Maximum Margin Coresets for Active and Noise Tolerant Learning , 2007, IJCAI.

[65]  David P. Woodruff,et al.  Subspace embeddings for the L1-norm with applications , 2011, STOC '11.

[66]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[67]  David P. Woodruff,et al.  The Fast Cauchy Transform and Faster Robust Linear Regression , 2012, SIAM journal on computing (Print).

[68]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[69]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[70]  Fabian Hadiji,et al.  Poisson Dependency Networks: Gradient Boosted Models for Multivariate Count Data , 2015, Machine Learning.

[71]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[72]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[73]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[74]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[75]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[76]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[77]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[78]  Hamid Zarrabi-Zadeh An Almost Space-Optimal Streaming Algorithm for Coresets in Fixed Dimensions , 2008, ESA.

[79]  Daniel M. Kane,et al.  Almost Optimal Explicit Johnson-Lindenstrauss Families , 2011, APPROX-RANDOM.

[80]  Huy L. Nguyen,et al.  Sparsity lower bounds for dimensionality reducing maps , 2012, STOC '13.

[81]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[82]  Timothy M. Chan,et al.  Better ϵ-Dependencies for Offline Approximate Nearest Neighbor Search, Euclidean Minimum Spanning Trees, and ϵ-Kernels , 2014, Symposium on Computational Geometry.

[83]  Hamid Zarrabi-Zadeh,et al.  Core-Preserving Algorithms , 2008, CCCG.

[84]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[85]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[86]  Noga Alon,et al.  Problems and results in extremal combinatorics--I , 2003, Discret. Math..

[87]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[88]  Sariel Har-Peled A Simple Algorithm for Maximum Margin Classification, Revisited , 2015, ArXiv.

[89]  Naum Zuselevich Shor,et al.  Minimization Methods for Non-Differentiable Functions , 1985, Springer Series in Computational Mathematics.

[90]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[91]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[92]  A. R. Davidson Theory of Probabilities. , 1953 .

[93]  Mikkel Thorup Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, ICALP.

[94]  H. En Lower Bounds for Oblivious Subspace Embeddings , 2013 .

[95]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..