From shallow feature learning to deep learning: Benefits from the width and depth of deep architectures

Since Pearson developed principal component analysis (PCA) in 1901, feature learning (or called representation learning) has been studied for more than 100 years. During this period, many “shallow” feature learning methods have been proposed based on various learning criteria and techniques, until the popular deep learning research in recent years. In this advanced review, we describe the historical profile of the shallow feature learning research and introduce the important developments of the deep learning models. Particularly, we survey the deep architectures with benefits from the optimization of their width and depth, as these models have achieved new records in many applications, such as image classification and object detection. Finally, several interesting directions of deep learning are presented and briefly discussed.

[1]  Yoshua Bengio,et al.  Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark , 2016, Pattern Recognit..

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kaizhu Huang,et al.  Is DeCAF Good Enough for Accurate Image Classification? , 2015, ICONIP.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[6]  Mohamed Cheriet,et al.  Low Rank Tensor Manifold Learning , 2014, Low-Rank and Sparse Modeling for Visual Analysis.

[7]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Shaoning Pang,et al.  Incremental linear discriminant analysis for classification of data streams , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[10]  Klaus-Robert Müller,et al.  SchNet: A continuous-filter convolutional neural network for modeling quantum interactions , 2017, NIPS.

[11]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Junyu Dong,et al.  Stretching deep architectures for text recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[14]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[15]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[16]  Jieping Ye,et al.  Two-Dimensional Linear Discriminant Analysis , 2004, NIPS.

[17]  Hailin Jin,et al.  Spatial-Semantic Image Search by Visual Feature Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Man Ieee Systems,et al.  IEEE transactions on systems, man and cybernetics. Part B, Cybernetics , 1996 .

[19]  David W. Lu,et al.  Agent Inspired Trading Using Recurrent Reinforcement Learning and LSTM Neural Networks , 2017, 1707.07338.

[20]  Xiaoyan Zhu,et al.  Linguistically Regularized LSTM for Sentiment Classification , 2016, ACL.

[21]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[22]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[23]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[24]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[25]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Luca Maria Gambardella,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[31]  Geoffrey Zweig,et al.  Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[32]  Jürgen Schmidhuber,et al.  A committee of neural networks for traffic sign classification , 2011, The 2011 International Joint Conference on Neural Networks.

[33]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[34]  Yoshua Bengio,et al.  Drawing and Recognizing Chinese Characters with Recurrent Neural Network , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Matthew Brand,et al.  Charting a Manifold , 2002, NIPS.

[37]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[38]  Jürgen Schmidhuber,et al.  Biologically Plausible Speech Recognition with LSTM Neural Nets , 2004, BioADIT.

[39]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[40]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[41]  Wu-Jun Li,et al.  Gaussian Process Latent Random Field , 2010, AAAI.

[42]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[43]  Sebastian Ramos,et al.  Detecting unexpected obstacles for self-driving cars: Fusing deep learning and geometric modeling , 2016, 2017 IEEE Intelligent Vehicles Symposium (IV).

[44]  Dong Xu,et al.  Trace Ratio vs. Ratio Trace for Dimensionality Reduction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Brendan J. Frey,et al.  PixelGAN Autoencoders , 2017, NIPS.

[46]  Jian Yang,et al.  Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[48]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[49]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[50]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Trevor Darrell,et al.  Discriminative Gaussian process latent variable model for classification , 2007, ICML '07.

[53]  Mohamed Cheriet,et al.  Relational Fisher Analysis: A general framework for dimensionality reduction , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[54]  Kilian Q. Weinberger,et al.  Unsupervised Learning of Image Manifolds by Semidefinite Programming , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[55]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[56]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[57]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[58]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Kilian Q. Weinberger,et al.  Unsupervised learning of image manifolds by semidefinite programming , 2004, CVPR 2004.

[61]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[62]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[63]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[64]  Junyu Dong,et al.  Visual Texture Perception with Feature Learning Models and Deep Architectures , 2014, CCPR.

[65]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[66]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[67]  Jürgen Schmidhuber,et al.  A comparison between spiking and differentiable recurrent neural networks on spoken digit recognition , 2004, Neural Networks and Computational Intelligence.

[68]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[71]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[72]  Ambedkar Dukkipati,et al.  Learning by Stretching Deep Networks , 2014, ICML.

[73]  Kaiqi Huang,et al.  GP-GAN: Towards Realistic High-Resolution Image Blending , 2017, ACM Multimedia.

[74]  Mohamed Cheriet,et al.  An Empirical Evaluation of Supervised Dimensionality Reduction for Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[75]  Lawrence K. Saul,et al.  Large-Margin Classification in Infinite Neural Networks , 2010, Neural Computation.

[76]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[77]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[78]  Ling Zhou,et al.  Demystifying AlphaGo Zero as AlphaGo GAN , 2017, ArXiv.

[79]  Zhihua Zhang,et al.  Probabilistic Relational PCA , 2009, NIPS.

[80]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[81]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[82]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[83]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[84]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[85]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[86]  Junyu Dong,et al.  An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning , 2016, ArXiv.

[87]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[88]  张振跃,et al.  Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment , 2004 .

[89]  Mohamed Cheriet,et al.  Large Margin Low Rank Tensor Analysis , 2013, Neural Computation.

[90]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[91]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[92]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[93]  Xiaogang Wang,et al.  DeepID3: Face Recognition with Very Deep Neural Networks , 2015, ArXiv.

[94]  Christoph H. Lampert,et al.  PixelCNN Models with Auxiliary Variables for Natural Image Modeling , 2017, ICML.

[95]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[96]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[97]  Guoqiang Zhong,et al.  The Necessary and Sufficient Conditions for the Existence of the Optimal Solution of Trace Ratio Problems , 2016, CCPR.

[98]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[99]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Mohamed Cheriet,et al.  Tensor representation learning based image patch analysis for text identification and recognition , 2015, Pattern Recognit..

[102]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[103]  Zhi Chen,et al.  Adversarial Feature Matching for Text Generation , 2017, ICML.

[104]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[105]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[106]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[107]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[108]  Xing Shi,et al.  Hafez: an Interactive Poetry Generation System , 2017, ACL.

[109]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[110]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[112]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[113]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[114]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[115]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[116]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[117]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[118]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[119]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[120]  Gunhee Kim,et al.  Attend to You: Personalized Image Captioning with Context Sequence Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[121]  Fernando A. Mujica,et al.  An Empirical Evaluation of Deep Learning on Highway Driving , 2015, ArXiv.

[122]  Cheng-Lin Liu,et al.  Error-correcting output codes based ensemble feature extraction , 2013, Pattern Recognit..

[123]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.