Understanding Machine Learning: From Theory to Algorithms

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  J. Hadamard Sur les problemes aux derive espartielles et leur signification physique , 1902 .

[3]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[4]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[5]  A. Tikhonov On the stability of inverse problems , 1943 .

[6]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[7]  John von Neumann,et al.  1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem , 1953 .

[8]  S. Agmon The Relaxation Method for Linear Inequalities , 1954, Canadian Journal of Mathematics.

[9]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[10]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[11]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[12]  David L. Phillips,et al.  A Technique for the Numerical Solution of Certain Integral Equations of the First Kind , 1962, JACM.

[13]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[14]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[15]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[16]  John Garcia,et al.  Relation of cue to consequence in avoidance learning , 1966 .

[17]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[18]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[19]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[20]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[21]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[22]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[23]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[24]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[25]  E. Slud Distribution Inequalities for the Binomial Law , 1977 .

[26]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[27]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[28]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[29]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[30]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[31]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[32]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .

[33]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[34]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[35]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[36]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[37]  R. Dudley Universal Donsker Classes and Metric Entropy , 1987 .

[38]  Leslie G. Valiant,et al.  Computational limitations on learning from examples , 1988, JACM.

[39]  S. Smale,et al.  On a theory of computation and complexity over the real numbers; np-completeness , 1989 .

[40]  Sally Floyd,et al.  Space-bounded learning and the Vapnik-Chervonenkis dimension , 1989, COLT '89.

[41]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[42]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[43]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[44]  R. Dudley,et al.  Uniform and universal Glivenko-Cantelli classes , 1991 .

[45]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[46]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[47]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[48]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[49]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[50]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[51]  Jayaram K. Sankaran A note on resolving infeasibility in linear programs by constraint relaxation , 1993, Oper. Res. Lett..

[52]  Ian Parberry,et al.  Circuit complexity and neural networks , 1994 .

[53]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[54]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[55]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[56]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[57]  Philip M. Long,et al.  A Generalization of Sauer's Lemma , 1995, J. Comb. Theory, Ser. A.

[58]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[59]  Philip M. Long,et al.  Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[60]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[61]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[62]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[63]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[64]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[65]  L. Houck,et al.  Foundations of Animal Behavior: Classic Papers with Commentaries , 1996 .

[66]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[67]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[68]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[69]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[70]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[71]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[72]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[73]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[74]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[75]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[76]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[77]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[78]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[79]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[80]  S. Ben-David,et al.  Combinatorial Variability of Vapnik-chervonenkis Classes with Applications to Sample Compression Schemes , 1998, Discrete Applied Mathematics.

[81]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[82]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[83]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[84]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[85]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[86]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[87]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[88]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[89]  Hans Ulrich Simon,et al.  Efficient Learning of Linear Perceptrons , 2000, NIPS.

[90]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[91]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[92]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[93]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[94]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[95]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[96]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[97]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[98]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[99]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[100]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[101]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[102]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[103]  Partha Niyogi,et al.  Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[104]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[105]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[106]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[107]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[108]  O. Bousquet Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms , 2002 .

[109]  P. Bartlett,et al.  Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[110]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[111]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[112]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[113]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[114]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[115]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[116]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[117]  Manfred K. Warmuth,et al.  Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1995, Machine Learning.

[118]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[119]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 2003, Machine Learning.

[120]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[121]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[122]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[123]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[124]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[125]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[126]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[127]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[128]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[129]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[130]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[131]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[132]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[133]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[134]  B. K. Natarajan On Learning Sets and Functions , 1989, Machine Learning.

[135]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[136]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[137]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[138]  Dan Roth,et al.  Learnability of Bipartite Ranking Functions , 2005, COLT.

[139]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[140]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[141]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[142]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[143]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[144]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[145]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, FOCS.

[146]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[147]  Gunnar Rätsch,et al.  Totally corrective boosting algorithms that maximize the margin , 2006, ICML.

[148]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[149]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[150]  O. Chapelle Large margin optimization of ranking measures , 2007 .

[151]  H. Robbins A Stochastic Approximation Method , 1951 .

[152]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[153]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[154]  Lior Rokach,et al.  Data Mining with Decision Trees - Theory and Applications , 2007, Series in Machine Perception and Artificial Intelligence.

[155]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[156]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[157]  A. Beygelzimer Multiclass Classification with Filter Trees , 2007 .

[158]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[159]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[160]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[161]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[162]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[163]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[164]  S. V. N. Vishwanathan,et al.  Entropy Regularized LPBoost , 2008, ALT.

[165]  Ambuj Tewari,et al.  Optimal Stragies and Minimax Lower Bounds for Online Convex Games , 2008, COLT.

[166]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[167]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[168]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[169]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[170]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[171]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[172]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[173]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[174]  Shai Ben-David,et al.  Agnostic Online Learning , 2009, COLT.

[175]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[176]  William T. Freeman,et al.  Informative Sensing , 2009, ArXiv.

[177]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[178]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[179]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[180]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[181]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[182]  Ohad Shamir,et al.  Learning Kernel-Based Halfspaces with the Zero-One Loss , 2010, COLT 2010.

[183]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[184]  Tong Zhang,et al.  Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints , 2010, SIAM J. Optim..

[185]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[186]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[187]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[188]  Andrea Montanari,et al.  The Noise-Sensitivity Phase Transition in Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[189]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[190]  Shai Ben-David,et al.  Multiclass Learnability and the ERM principle , 2011, COLT.

[191]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[192]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[193]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[194]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[195]  Amit Daniely,et al.  Multiclass Learning Approaches: A Theoretical Comparison with Implications , 2012, NIPS.

[196]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[197]  Roi Livni,et al.  Honest Compressions and Their Application to Compression Schemes , 2013, COLT.

[198]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[199]  Roi Livni,et al.  A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[200]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[201]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[202]  Lee-Ad Gottlieb,et al.  Efficient Classification for Metric Data , 2014, IEEE Trans. Inf. Theory.

[203]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..