Incremental partial least squares analysis of big streaming data

Incremental feature extraction is effective for facilitating the analysis of large-scale streaming data. However, most current incremental feature extraction methods are not suitable for processing streaming data with high feature dimensions because only a few methods have low time complexity, which is linear with both the number of samples and features. In addition, feature extraction methods need to improve the performance of further classification. Therefore, incremental feature extraction methods need to be more efficient and effective. Partial least squares (PLS) is known to be an effective dimension reduction technique for classification. However, the application of PLS to streaming data is still an open problem. In this study, we propose a highly efficient and powerful dimension reduction algorithm called incremental PLS (IPLS), which comprises a two-stage extraction process. In the first stage, the PLS target function is adapted so it is incremental by updating the historical mean to extract the leading projection direction. In the second stage, the other projection directions are calculated based on the equivalence between the PLS vectors and the Krylov sequence. We compared the performance of IPLS with other state-of-the-art incremental feature extraction methods such as incremental principal components analysis, incremental maximum margin criterion, and incremental inter-class scatter using real streaming datasets. Our empirical results showed that IPLS performed better than other methods in terms of its efficiency and further classification accuracy.

[1]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[2]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[3]  Philip S. Yu,et al.  Mining Data Streams , 2005, The Data Mining and Knowledge Discovery Handbook.

[4]  Isabelle Guyon,et al.  Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark , 2007, Pattern Recognit. Lett..

[5]  Yongmin Li,et al.  On incremental and robust subspace learning , 2004, Pattern Recognit..

[6]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[7]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[8]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[9]  Ian Witten,et al.  Data Mining , 2000 .

[10]  Weiguo Fan,et al.  Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  Gengfeng Wu,et al.  Irrelevant gene elimination for Partial Least Squares based Dimension Reduction by using feature probes , 2009, Int. J. Data Min. Bioinform..

[12]  Hua Li,et al.  A scalable supervised algorithm for dimensionality reduction on streaming data , 2006, Inf. Sci..

[13]  Hua Li,et al.  IMMC: incremental maximum margin criterion , 2004, KDD.

[14]  Alejandro F. Frangi,et al.  Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004 .

[15]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[16]  I. Helland ON THE STRUCTURE OF PARTIAL LEAST SQUARES REGRESSION , 1988 .

[17]  Zhang Yi,et al.  A Family of Fuzzy Learning Algorithms for Robust Principal Component Analysis Neural Networks , 2010, IEEE Transactions on Fuzzy Systems.

[18]  YangJian,et al.  Two-Dimensional PCA , 2004 .

[19]  Tao Jiang,et al.  Efficient and robust feature extraction by maximum margin criterion , 2003, IEEE Transactions on Neural Networks.

[20]  Giovanni Montana,et al.  Sparse partial least squares regression for on-line variable selection with multivariate data streams , 2010 .

[21]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Guodong Guo,et al.  Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression , 2011, CVPR 2011.

[23]  K. Helland,et al.  Recursive algorithm for partial least squares regression , 1992 .

[24]  Zhigang Luo,et al.  Online Nonnegative Matrix Factorization With Robust Stochastic Approximation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Shaoning Pang,et al.  Incremental linear discriminant analysis for classification of data streams , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[26]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[27]  Tat-Jun Chin,et al.  Incremental Kernel Principal Component Analysis , 2007, IEEE Transactions on Image Processing.

[28]  Anil K. Jain,et al.  Incremental nonlinear dimensionality reduction by manifold learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[30]  I. Jolliffe Principal Component Analysis , 2002 .

[31]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[32]  Juyang Weng,et al.  Candid Covariance-Free Incremental Principal Component Analysis , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[34]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[35]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[36]  Hiroshi Mizoguchi,et al.  Convergence analysis of online linear discriminant analysis , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[37]  Ales Leonardis,et al.  Incremental PCA for on-line visual learning and recognition , 2002, Object recognition supported by user interaction for service robots.

[38]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[39]  Daoqiang Zhang,et al.  Diagonal principal component analysis for face recognition , 2006, Pattern Recognit..

[40]  Stan Lipovetsky,et al.  Finding cluster centers and sizes via multinomial parameterization , 2013, Appl. Math. Comput..

[41]  William S. Rayens,et al.  PLS and dimension reduction for classification , 2007, Comput. Stat..

[42]  Yong Wang,et al.  Incremental learning of complete linear discriminant analysis for face recognition , 2012, Knowl. Based Syst..

[43]  Soon Keat Tan,et al.  Localized, Adaptive Recursive Partial Least Squares Regression for Dynamic System Modeling , 2012 .

[44]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..