Greedy column subset selection for large-scale data sets

In today’s information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively distributed data, which is formally known as the Column Subset Selection problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function, which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection, which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm, which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.

[1]  Mohamed S. Kamel,et al.  An Efficient Greedy Method for Unsupervised Feature Selection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[4]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[5]  Malik Magdon-Ismail,et al.  Deterministic Sparse Column Based Matrix Reconstruction via Greedy Approximation of SVD , 2008, ISAAC.

[6]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[7]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[8]  I. Jolliffe Principal Component Analysis , 2002 .

[9]  Gene H. Golub,et al.  Matrix computations , 1983 .

[10]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[11]  Christian H. Bischof,et al.  Computing rank-revealing QR factorizations of dense matrices , 1998, TOMS.

[12]  Venkatesan Guruswami,et al.  Optimal column-based low-rank matrix reconstruction , 2011, SODA.

[13]  J KriegmanDavid,et al.  Acquiring Linear Subspaces for Face Recognition under Variable Lighting , 2005 .

[14]  Charalampos E. Tsourakakis,et al.  HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop , 2008 .

[15]  Sjsu ScholarWorks,et al.  Rank revealing QR factorizations , 2014 .

[16]  S. Muthukrishnan,et al.  Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods , 2006, APPROX-RANDOM.

[17]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[18]  Yuxiao Hu,et al.  Face recognition using Laplacianfaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[20]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix , 2006, SIAM J. Comput..

[21]  Ashraf Aboulnaga,et al.  Scalable maximum clique computation using MapReduce , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[22]  Mohamed S. Kamel,et al.  Efficient greedy feature selection for unsupervised learning , 2012, Knowledge and Information Systems.

[23]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[24]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[25]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[26]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[27]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[28]  Fakhri Karray,et al.  Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce , 2013, SDM.

[29]  Ying Cui,et al.  Convex Principal Feature Selection , 2010, SDM.

[30]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[31]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Klaus Jansen,et al.  Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , 2006, Lecture Notes in Computer Science.

[33]  H. Luetkepohl The Handbook of Matrices , 1996 .

[34]  ElgoharyAhmed,et al.  Greedy column subset selection for large-scale data sets , 2015 .

[35]  Christos Boutsidis,et al.  Clustered subset selection and its applications on it service metrics , 2008, CIKM '08.

[36]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[37]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[38]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[39]  Christos Boutsidis,et al.  Unsupervised feature selection for principal components analysis , 2008, KDD.

[40]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[41]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[42]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[43]  C. Pan On the existence and computation of rank-revealing LU factorizations , 2000 .

[44]  Michael W. Mahoney,et al.  Robust Regression on MapReduce , 2013, ICML.

[45]  Malik Magdon-Ismail,et al.  Column subset selection via sparse approximation of SVD , 2012, Theor. Comput. Sci..

[46]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[47]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[48]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[50]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[51]  BakerSimon,et al.  The CMU Pose, Illumination, and Expression Database , 2003 .

[52]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[53]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[54]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[56]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[57]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[58]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[59]  Terence Sim,et al.  The CMU Pose, Illumination, and Expression Database , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Mohamed S. Kamel,et al.  Distributed Column Subset Selection on MapReduce , 2013, 2013 IEEE 13th International Conference on Data Mining.

[61]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[62]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.