sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms

Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/ MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.

[1]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[3]  Nixon,et al.  Feature Extraction & Image Processing , 2008 .

[4]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[5]  Ben J. A. Kröse,et al.  Active Appearance-Based Robot Localization Using Stereo Vision , 2005, Auton. Robots.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[9]  Qian Du,et al.  Hyperspectral Image Compression Using JPEG2000 and Principal Component Analysis , 2007, IEEE Geoscience and Remote Sensing Letters.

[10]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[11]  James Demmel,et al.  Accurate Singular Values of Bidiagonal Matrices , 1990, SIAM J. Sci. Comput..

[12]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[13]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[14]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[15]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[16]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[21]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[22]  Neoklis Polyzotis,et al.  Declarative Systems for Large-Scale Machine Learning , 2012, IEEE Data Eng. Bull..

[23]  Pierre-Antoine Absil,et al.  Principal Manifolds for Data Visualization and Dimension Reduction , 2007 .

[24]  Vicente Hernández,et al.  A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization. , 2007 .

[25]  Nathan Halko,et al.  Randomized methods for computing low-rank approximations of matrices , 2012 .

[26]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[27]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[28]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[29]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[30]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.