k-Means Clustering of Lines for Big Data

The input to the $k$-median for lines problem is a set $L$ of $n$ lines in $\mathbb{R}^d$, and the goal is to compute a set of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum of squared distances over every line in $L$ and its nearest center. This is a straightforward generalization of the $k$-median problem where the input is a set of $n$ points instead of lines. We suggest the first PTAS that computes a $(1+\epsilon)$-approximation to this problem in time $O(n \log n)$ for any constant approximation error $\epsilon \in (0, 1)$, and constant integers $k, d \geq 1$. This is by proving that there is always a weighted subset (called coreset) of $dk^{O(k)}\log (n)/\epsilon^2$ lines in $L$ that approximates the sum of squared distances from $L$ to any given set of $k$ points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to $M$ machines in one pass (e.g. cloud) using memory, update time and communication that is near-logarithmic in $n$, as well as deletion of any line but using linear space. These results generalized for other distance functions such as $k$-median (sum of distances) or ignoring farthest $m$ lines from the given centers to handle outliers. Experimental results on 10 machines on Amazon EC2 cloud show that the algorithm performs well in practice. Open source code for all the algorithms and experiments is also provided. This thesis is an extension of the following accepted paper: "$k$-Means Clustering of Lines for Big Data", by Yair Marom & Dan Feldman, Proceedings of NeurIPS 2019 conference, to appear on December 2019.

[1]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[2]  Qing Zhang,et al.  High-Performance Computing on the Intel® Xeon Phi™ , 2014, Springer International Publishing.

[3]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[4]  Wolfram Burgard,et al.  Using EM to Learn 3D Models of Indoor Environments with Mobile Robots , 2001, ICML.

[5]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[6]  Ronald R. Coifman,et al.  Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators , 2005, NIPS.

[7]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[8]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[9]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[10]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[11]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[12]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[13]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[14]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[15]  L. S. Nelson Exact Critical Values for Use with the Analysis of Means , 1983 .

[16]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[17]  L. Carin,et al.  Analytical Kernel Matrix Completion with Incomplete Multi-View Data , 2005 .

[18]  Pankaj K. Agarwal,et al.  Approximation algorithms for projective clustering , 2000, SODA '00.

[19]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[20]  Nicholas L. Crookston,et al.  yaImpute: An R Package for kNN Imputation , 2008 .

[21]  Ping Li,et al.  Online Low-Rank Subspace Clustering by Basis Dictionary Pursuit , 2015, ICML.

[22]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[23]  Christos H. Papadimitriou,et al.  Worst-Case and Probabilistic Analysis of a Geometric Location Problem , 1981, SIAM J. Comput..

[24]  Micha Sharir,et al.  Ja n 20 10 Relative ( p , ε )-Approximations in Geometry ∗ , 2010 .

[25]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[26]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[27]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[28]  Dan Feldman,et al.  Coresets for Vector Summarization with Applications to Network Graphs , 2017, ICML.

[29]  H. Warren Lower bounds for approximation by nonlinear manifolds , 1968 .

[30]  Leonard J. Schulman,et al.  Clustering Affine Subspaces: Hardness and Algorithms , 2013, SODA.

[31]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[32]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[33]  Pankaj K. Agarwal,et al.  A (1+)-approximation algorithm for 2-line-center , 2003, Comput. Geom..

[34]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[35]  V. Klee,et al.  Helly's theorem and its relatives , 1963 .

[36]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[37]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[38]  Ibrahim Jubran,et al.  Minimizing Sum of Non-Convex but Piecewise log-Lipschitz Functions using Coresets , 2018, ArXiv.

[39]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[40]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[41]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[42]  Sariel Har-Peled,et al.  Projective clustering in high dimensions using core-sets , 2002, SCG '02.

[43]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[44]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[45]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[46]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[47]  Jie Gao,et al.  Analysis of Incomplete Data and an Intrinsic-Dimension Helly Theorem , 2006, SODA '06.

[48]  Jie Gao,et al.  Clustering lines in high-dimensional space: Classification of incomplete data , 2010, TALG.

[49]  David Eisenstat,et al.  The VC dimension of k-fold union , 2007, Inf. Process. Lett..

[50]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[51]  Jitendra Malik,et al.  Multi-scale object detection by clustering lines , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[52]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[53]  Thomas B. Moeslund,et al.  How Does Structured Sparsity Work in Abnormal Event Detection , 2015, ICML 2015.

[54]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[55]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[56]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[57]  Silvio Lattanzi,et al.  One-Shot Coresets: The Case of k-Clustering , 2017, AISTATS.

[58]  Keqiu Li,et al.  Big Data Processing in Cloud Computing Environments , 2012, 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks.

[59]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[60]  Dan Feldman Coresets and Their Applications , 2012 .

[61]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[62]  Amos Fiat,et al.  Bi-criteria linear-time approximations for generalized k-mean/median/center , 2007, SCG '07.

[63]  Zvi Drezner,et al.  Facility location - applications and theory , 2001 .

[64]  Micha Sharir,et al.  Relative (p,ε)-Approximations in Geometry , 2011, Discret. Comput. Geom..