Core‐sets: An updated survey

In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a some for a specific query set of k centers. However, traditional algorithms cannot handle modern systems that require parallel real‐time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots. Core‐set is a “small data” summarization of the input “big data,” where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintenance of streaming, distributed and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions. The challenge is to design coresets with provable tradeoff between their size and approximation error. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state‐of‐the‐art.

[1]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[2]  Dan Feldman,et al.  Coresets for Differentially Private K-Means Clustering and Applications to Privacy in Mobile Sensor Networks , 2017, 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[3]  Pankaj K. Agarwal,et al.  Approximation algorithms for projective clustering , 2000, SODA '00.

[4]  Dan Feldman,et al.  Coresets for Vector Summarization with Applications to Network Graphs , 2017, ICML.

[5]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[6]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[7]  R. Sarathy,et al.  Fool's Gold: an Illustrated Critique of Differential Privacy , 2013 .

[8]  Jirí Matousek,et al.  New constructions of weak epsilon-nets , 2003, SCG '03.

[9]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[10]  Michael Langberg,et al.  Universal epsilon-approximators for Integrals , 2010, ACM-SIAM Symposium on Discrete Algorithms.

[11]  Dan Feldman,et al.  Quadcopter Tracks Quadcopter via Real-Time Shape Fitting , 2018, IEEE Robotics and Automation Letters.

[12]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13]  Dan Feldman,et al.  Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds , 2018, ICLR.

[14]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[15]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[16]  Dan Feldman,et al.  Secure Search on Encrypted Data via Multi-Ring Sketch , 2018, CCS.

[17]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[18]  Xin Xiao,et al.  A near-linear algorithm for projective clustering integer points , 2012, SODA.

[19]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[20]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[21]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[22]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[23]  Ibrahim Jubran,et al.  Fast and Accurate Least-Mean-Squares Solvers for High Dimensional Data , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.

[25]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[26]  Dan Feldman,et al.  From High Definition Image to Low Space Optimization , 2011, SSVM.

[27]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[28]  Gerhard Tutz,et al.  Boosting ridge regression , 2007, Comput. Stat. Data Anal..

[29]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[30]  Sepehr Assadi,et al.  Randomized Composable Coresets for Matching and Vertex Cover , 2017, SPAA.

[31]  Dan Feldman,et al.  iDiary: from GPS signals to a text-searchable diary , 2013, SenSys '13.

[32]  Sunglok Choi,et al.  Performance Evaluation of RANSAC Family , 2009, BMVC.

[33]  Amos Fiat,et al.  Bi-criteria linear-time approximations for generalized k-mean/median/center , 2007, SCG '07.

[34]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[35]  Bala Kalyanasundaram,et al.  Breaking the probability ½ barrier in FIN-type learning , 1992, COLT '92.

[36]  Jirí Matousek,et al.  Approximations and Optimal Geometric Divide-an-Conquer , 1995, J. Comput. Syst. Sci..

[37]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[38]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[39]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[40]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[41]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[42]  Michael Mikolajczak,et al.  Designing And Building Parallel Programs: Concepts And Tools For Parallel Software Engineering , 1997, IEEE Concurrency.

[43]  Tamir Tassa,et al.  More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data , 2015, KDD.

[44]  Pankaj K. Agarwal,et al.  Approximation Algorithms for k-Line Center , 2002, ESA.

[45]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[46]  Pankaj K. Agarwal,et al.  Maintaining approximate extent measures of moving points , 2001, SODA '01.

[47]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[48]  Leonidas J. Guibas,et al.  Improved bounds on weak ε-nets for convex sets , 1993, STOC.

[49]  Dan Feldman,et al.  Coresets For Monotonic Functions with Applications to Deep Learning , 2018, ArXiv.

[50]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[51]  Alexander Munteanu,et al.  Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms , 2017, KI - Künstliche Intelligenz.

[52]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[53]  Kasturi R. Varadarajan,et al.  Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.

[54]  Antony J. Williams,et al.  Beautiful Data: The Stories Behind Elegant Data Solutions , 2009 .

[55]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[56]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[57]  Sariel Har-Peled,et al.  Projective clustering in high dimensions using core-sets , 2002, SCG '02.

[58]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[59]  Maarten Löffler,et al.  Shape Fitting on Point Sets with Probability Distributions , 2008, ESA.

[60]  Paul Newman,et al.  Visual precis generation using coresets , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[61]  Michelle Effros,et al.  Rapid near-optimal VQ design with a deterministic data net , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[62]  Dan Feldman,et al.  An effective coreset compression algorithm for large scale sensor networks , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[63]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[64]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[65]  Dan Feldman,et al.  The single pixel GPS: learning big data signals from tiny coresets , 2012, SIGSPATIAL/GIS.

[66]  Kasturi R. Varadarajan,et al.  No Coreset, No Cry: II , 2005, FSTTCS.

[67]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[68]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[69]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[70]  John W. Fisher,et al.  Coresets for k-Segmentation of Streaming Data , 2014, NIPS.

[71]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[72]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[73]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[74]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Learning Model , 1990, ALT.

[75]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[76]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[77]  Stefan Funke,et al.  Bounded-Hop Energy-Efficient Broadcast in Low-Dimensional Metrics Via Coresets , 2007, STACS.

[78]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[79]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[80]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[81]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[82]  Suresh Venkatasubramanian,et al.  Comparing distributions and shapes using the kernel distance , 2010, SoCG '11.

[83]  Sudipto Guha,et al.  Improved Combinatorial Algorithms for Facility Location Problems , 2005, SIAM J. Comput..

[84]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[85]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[86]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[87]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[88]  Dan Feldman,et al.  Secure Data Retrieval On The Cloud Homomorphic Encryption Meets Coresets , 2019, IACR Cryptol. ePrint Arch..

[89]  Philip H. S. Torr,et al.  The Development and Comparison of Robust Methods for Estimating the Fundamental Matrix , 1997, International Journal of Computer Vision.

[90]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[91]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[92]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[93]  Ibrahim Jubran,et al.  Low-cost and Faster Tracking Systems Using Core-sets for Pose-Estimation , 2015, ArXiv.

[94]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.