论文信息 - Practical Coreset Constructions for Machine Learning

Practical Coreset Constructions for Machine Learning

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to $k$-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization.

[1] Dan Feldman,et al. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[2] Andreas Krause,et al. Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[3] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[4] S. Muthukrishnan,et al. Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[5] Trevor Campbell,et al. Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[6] Ivor W. Tsang,et al. Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[7] Michael I. Jordan,et al. Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[8] Sariel Har-Peled. Geometric Approximation Algorithms , 2011 .

[9] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC.

[10] Andreas Krause,et al. Linear-Time Outlier Detection via Sensitivity , 2016, IJCAI.

[11] Andreas Krause,et al. Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[12] Rameshwar Pratap,et al. Faster Coreset Construction for Projective Clustering via Low-Rank Approximation , 2018, IWOCA.

[13] Kasturi R. Varadarajan,et al. Geometric Approximation via Coresets , 2007 .

[14] Alexander J. Smola,et al. Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[15] David P. Woodruff,et al. Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[16] Amos Fiat,et al. Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[17] Dan Feldman,et al. A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[18] Andreas Krause,et al. Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[19] Christos Boutsidis,et al. Near-Optimal Column-Based Matrix Reconstruction , 2014, SIAM J. Comput..

[20] Maxim Sviridenko,et al. A Bi-Criteria Approximation Algorithm for k-Means , 2015, APPROX-RANDOM.

[21] Andreas Krause,et al. Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning , 2015, AISTATS.

[22] Pankaj K. Agarwal,et al. Approximation Algorithms for k-Line Center , 2002, ESA.

[23] Teofilo F. GONZALEZ,et al. Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[24] Rameshwar Pratap,et al. Faster coreset construction for subspace and projective clustering , 2016, ArXiv.

[25] Michael W. Mahoney. Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[26] Sariel Har-Peled,et al. Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[27] David Haussler,et al. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[28] Bernard Chazelle,et al. On linear-time deterministic algorithms for optimization problems in fixed dimension , 1996, SODA '93.