Data reduction for weighted and outlier-resistant clustering

Statistical data frequently includes outliers; these can distort the results of estimation procedures and optimization problems. For this reason, loss functions which deemphasize the effect of outliers are widely used by statisticians. However, there are relatively few algorithmic results about clustering with outliers. For instance, the k-median with outliers problem uses a loss function [EQUATION] (x) which is equal to the minimum of a penalty h, and the least distance between the data point x and a center ci. The loss-minimizing choice of {c1,..., ck} is an outlier-resistant clustering of the data. This problem is also a natural special case of the k-median with penalties problem considered by [Charikar, Khuller, Mount and Narasimhan SODA'01]. The essential challenge that arises in these optimization problems is data reduction for the weighted k-median problem. We solve this problem, which was previously solved only in one dimension ([Har-Peled FSTTCS'06], [Feldman, Fiat and Sharir FOCS'06]). As a corollary, we also achieve improved data reduction for the k-line-median problem.

[1]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[2]  David P. Williamson,et al.  A general approximation technique for constrained forest problems , 1992, SODA '92.

[3]  F. Mosteller,et al.  Understanding robust and exploratory data analysis , 1985 .

[4]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[5]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[6]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.

[7]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[8]  Shai Ben-David,et al.  Characterizations of learnability for classes of {O, …, n}-valued functions , 1992, COLT '92.

[9]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[10]  Johanna S. Hardin,et al.  A robust measure of correlation between two genes on a microarray , 2007, BMC Bioinformatics.

[11]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[12]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[13]  B. Natarajan On learning sets and functions , 2004, Machine Learning.

[14]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[15]  Jirí Matousek,et al.  Approximations and optimal geometric divide-and-conquer , 1991, STOC '91.

[16]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[17]  Jeffrey Scott Vitter,et al.  e-approximations with minimum packing constraint violation (extended abstract) , 1992, STOC '92.

[18]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[19]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[20]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[21]  David P. Williamson,et al.  A note on the prize collecting traveling salesman problem , 1993, Math. Program..

[22]  David R. Karger,et al.  Approximating s – t Minimum Cuts in ~ O(n 2 ) Time , 2007 .

[23]  Robert E. Schapire,et al.  Efficient Distribution-Free Learning of Probabilistic , 1994 .

[24]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[25]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 1998, FOCS.

[26]  Kasturi R. Varadarajan,et al.  Sampling-based dimension reduction for subspace approximation , 2007, STOC '07.

[27]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[28]  Rui Wang,et al.  A GPU-Based Approximate SVD Algorithm , 2011, PPAM.

[29]  Philip M. Long,et al.  Prediction, Learning, Uniform Convergence, and Scale-Sensitive Dimensions , 1998, J. Comput. Syst. Sci..

[30]  Philip M. Long,et al.  Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[31]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[32]  Pankaj K. Agarwal,et al.  An Efficient Algorithm for 2D Euclidean 2-Center with Outliers , 2008, ESA.

[33]  Noga Alon,et al.  On Two Segmentation Problems , 1999, J. Algorithms.

[34]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[35]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[36]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[37]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[38]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[39]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[40]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[41]  David Haussler,et al.  ɛ-nets and simplex range queries , 1987, Discret. Comput. Geom..

[42]  Pankaj K. Agarwal,et al.  Approximation Algorithms for k-Line Center , 2002, ESA.

[43]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[44]  D. Pollard Convergence of stochastic processes , 1984 .

[45]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[46]  Michael Langberg,et al.  Universal epsilon-approximators for Integrals , 2010, ACM-SIAM Symposium on Discrete Algorithms.

[47]  Elvezio Ronchetti,et al.  A smoothing principle for the Huber and other location M-estimators , 2011, Comput. Stat. Data Anal..

[48]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[49]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[50]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[51]  Vladimir Vapnik,et al.  Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures) , 1989, COLT '89.

[52]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[53]  Michelle Effros,et al.  Deterministic clustering with data nets , 2004, Electron. Colloquium Comput. Complex..