Online Incremental Feature Learning with Denoising Autoencoders

While determining model complexity is an important problem in machine learning, many feature learning algorithms rely on cross-validation to choose an optimal number of features, which is usually challenging for online learning from a massive stream of data. In this paper, we propose an incremental feature learning algorithm to determine the optimal model complexity for large-scale, online datasets based on the denoising autoencoder. This algorithm is composed of two processes: adding features and merging features. Specifically, it adds new features to minimize the objective function’s residual and merges similar features to obtain a compact feature representation and prevent over-fitting. Our experiments show that the proposed model quickly converges to the optimal number of features in a large-scale online setting. In classification tasks, our model outperforms the (non-incremental) denoising autoencoder, and deep networks constructed from our algorithm perform favorably compared to deep belief networks and stacked denoising autoencoders. Further, the algorithm is eective in recognizing new patterns when the data distribution changes over time in the massive online data stream.

[1]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[4]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[7]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[8]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[9]  Toniann Pitassi,et al.  A Gradient-Based Boosting Algorithm for Regression Problems , 2000, NIPS.

[10]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[14]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[15]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[16]  L. Bottou,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[17]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[18]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[21]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[22]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[23]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[24]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[25]  Daphne Koller,et al.  Learning a meta-level prior for feature relevance from multiple related tasks , 2007, ICML '07.

[26]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[27]  En Zhu,et al.  An Incremental Feature Learning Algorithm Based on Least Square Support Vector Machine , 2008, FAW.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Samy Bengio,et al.  An Online Algorithm for Large Scale Image Similarity Learning , 2009, NIPS.

[30]  Nando de Freitas,et al.  Sparsity priors and boosting for learning localized distributed feature representations , 2010 .

[31]  Horst Bischof,et al.  Online multi-class LPBoost , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  J. Andrew Bagnell,et al.  Boosted Backpropagation Learning for Training Deep Modular Networks , 2010, ICML.

[33]  Ryan P. Adams,et al.  Learning the Structure of Deep Sparse Graphical Models , 2009, AISTATS.

[34]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[35]  David B. Dunson,et al.  The Hierarchical Beta Process for Convolutional Factor Analysis and Deep Learning , 2011, ICML.