Learning Algorithms from Data

Statistical machine learning is concerned with learning models that describe observations. We train our models from data on tasks like machine translation or object recognition because we cannot explicitly write down programs to solve such problems. A statistical model is only useful when it generalizes to unseen data. Solomonoff114 has proved that one should choose the model that agrees with the observed data, while preferring the model that can be compressed the most, because such a choice guarantees the best possible generalization. The size of the best possible compression of the model is called the Kolmogorov complexity of the model. We define an algorithm as a function with small Kolmogorov complexity. This Ph.D. thesis outlines the problem of learning algorithms from data and shows several partial solutions to it. Our data model is mainly neural networks as they have proven to be successful in various domains like object recognition67,109,122, language modelling90, speech recognition48,39 and others. First, we examine empirical trainability limits for classical neural networks. Then, we extend them by providing interfaces, which provide a way to read memory, access the input, and postpone predictions. The model learns how to use them with reinforcement learning techniques like REINFORCE and Q-learning. Next, we ex-

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[3]  Lev Davidovich Landau,et al.  What is the theory of relativity , 2003 .

[4]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[5]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Donald E. Knuth,et al.  Semantics of context-free languages , 1968, Mathematical systems theory.

[7]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[8]  Edgar N. Reyes,et al.  Optimization using simulated annealing , 1998, Northcon/98. Conference Proceedings (Cat. No.98CH36264).

[9]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[10]  Mark Crovella,et al.  Graph wavelets for spatial traffic analysis , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[11]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..

[12]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[13]  Kablan Barbar,et al.  Using attribute grammars to find solutions for musical equational programs , 1994, SIGP.

[14]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[15]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[16]  Mark Wineberg,et al.  A Representation Scheme To Perform Program Induction in a Canonical Genetic Algorithm , 1994, PPSN.

[17]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[18]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[19]  Yann LeCun,et al.  Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[22]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[23]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[25]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Bogdan M. Wilamowski,et al.  Solving parity-N problems with feedforward neural networks , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[29]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[30]  Peter Schröder,et al.  Multiresolution signal processing for meshes , 1999, SIGGRAPH.

[31]  Samuel R. Bowman Can recursive neural tensor networks learn logical reasoning? , 2014, ICLR.

[32]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[33]  Michael Sipser,et al.  Borel sets and circuit complexity , 1983, STOC.

[34]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[35]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[36]  Hugh G. Gauch,et al.  Scientific method in practice , 2002 .

[37]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[38]  Yann LeCun,et al.  Emergence of Complex-Like Cells in a Temporal Product Network with Local Receptive Fields , 2010, ArXiv.

[39]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[40]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[41]  P. Simon Too Big to Ignore: The Business Case for Big Data , 2013 .

[42]  Achi Brandt,et al.  Efficient Multilevel Eigensolvers with Applications to Data Analysis Tasks , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  T. Hill A Statistical Derivation of the Significant-Digit Law , 1995 .

[47]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[48]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[49]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Ronald R. Coifman,et al.  Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi Supervised Learning , 2010, ICML.

[51]  Dan Klein,et al.  Learning Dependency-Based Compositional Semantics , 2011, CL.

[52]  Yoshua Bengio,et al.  The problem of learning long-term dependencies in recurrent networks , 1993, IEEE International Conference on Neural Networks.

[53]  Steven McCanne,et al.  An attribute grammar based framework for machine-dependent computational optimization of media processing algorithms , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[54]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[55]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[56]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[57]  Wojciech Zaremba,et al.  Learning to Discover Efficient Mathematical Identities , 2014, NIPS.

[58]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[59]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[60]  Leonidas J. Guibas,et al.  Wavelets on Graphs via Deep Learning , 2013, NIPS.

[61]  Quoc V. Le,et al.  Tiled convolutional neural networks , 2010, NIPS.

[62]  Walter L. Ruzzo On Uniform Circuit Complexity , 1981, J. Comput. Syst. Sci..

[63]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[64]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[65]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[66]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[67]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[68]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[69]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[70]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[71]  Marcin Andrychowicz,et al.  Learning Efficient Algorithms with Hierarchical Attentive Memory , 2016, ArXiv.

[72]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[73]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[74]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[75]  Thomas G. Dietterich,et al.  Structured machine learning: the next ten years , 2008, Machine Learning.

[76]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[77]  Andrew Y. Ng,et al.  Selecting Receptive Fields in Deep Networks , 2011, NIPS.

[78]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[79]  Jean-Marc Steyaert,et al.  An approximate matching algorithm for finding (sub-)optimal sequences in S-attributed grammars , 2002, ECCB.

[80]  Joshua B. Tenenbaum,et al.  Church: a language for generative models , 2008, UAI.

[81]  Nicolas Le Roux,et al.  Learning the 2-D Topology of Images , 2007, NIPS.

[82]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[83]  Roman Smolensky,et al.  Algebraic methods in the theory of lower bounds for Boolean circuit complexity , 1987, STOC.

[84]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[85]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[86]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[87]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[88]  Jürgen Schmidhuber,et al.  Optimal Ordered Problem Solver , 2002, Machine Learning.

[89]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[90]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[91]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[92]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[93]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .