论文信息 - Understanding Machine Learning: From Theory to Algorithms

Understanding Machine Learning: From Theory to Algorithms

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

[1] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2] J. Hadamard. Sur les problemes aux derive espartielles et leur signification physique , 1902 .

[3] R. Fisher,et al. On the Mathematical Foundations of Theoretical Statistics , 1922 .

[4] J. Neumann. Zur Theorie der Gesellschaftsspiele , 1928 .

[5] A. Tikhonov. On the stability of inverse problems , 1943 .

[6] de Ng Dick Bruijn. A combinatorial problem , 1946 .

[7] John von Neumann,et al. 1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem , 1953 .

[8] S. Agmon. The Relaxation Method for Linear Inequalities , 1954, Canadian Journal of Mathematics.

[9] H. Kuhn. The Hungarian method for the assignment problem , 1955 .

[10] Philip Wolfe,et al. An algorithm for quadratic programming , 1956 .

[11] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[12] David L. Phillips,et al. A Technique for the Numerical Solution of Certain Integral Equations of the First Kind , 1962, JACM.

[13] Albert B Novikoff,et al. ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[14] M. Aizerman,et al. Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[15] Thomas M. Cover,et al. Behavior of sequential predictors of binary sequences , 1965 .

[16] John Garcia,et al. Relation of cue to consequence in avoidance learning , 1966 .

[17] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[18] Marvin Minsky,et al. Perceptrons: An Introduction to Computational Geometry , 1969 .

[19] Vladimir Vapnik,et al. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[20] Norbert Sauer,et al. On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[21] Richard M. Karp,et al. Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[22] S. Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[23] Ronald L. Rivest,et al. Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[24] C. J. Stone,et al. Consistent Nonparametric Regression , 1977 .

[25] E. Slud. Distribution Inequalities for the Binomial Law , 1977 .

[26] W. Rogers,et al. A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[27] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[28] G. Pisier. Remarques sur un résultat non publié de B. Maurey , 1981 .

[29] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[30] J. Rissanen. A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[31] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[32] Luc Devroye,et al. Nonparametric Density Estimation , 1985 .

[33] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[34] L. Rabiner,et al. An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[35] David Haussler,et al. Occam's Razor , 1987, Inf. Process. Lett..

[36] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[37] R. Dudley. Universal Donsker Classes and Metric Entropy , 1987 .

[38] Leslie G. Valiant,et al. Computational limitations on learning from examples , 1988, JACM.

[39] S. Smale,et al. On a theory of computation and complexity over the real numbers; np-completeness , 1989 .

[40] Sally Floyd,et al. Space-bounded learning and the Vapnik-Chervonenkis dimension , 1989, COLT '89.

[41] David Haussler,et al. Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[42] Vladimir Vovk,et al. Aggregating strategies , 1990, COLT '90.

[43] Vladimir Vapnik,et al. Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[44] R. Dudley,et al. Uniform and universal Glivenko-Cantelli classes , 1991 .

[45] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[46] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[47] Linda Sellie,et al. Toward efficient agnostic learning , 1992, COLT '92.

[48] David Haussler,et al. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[49] J. Hiriart-Urruty,et al. Convex analysis and minimization algorithms , 1993 .

[50] Stéphane Mallat,et al. Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[51] Jayaram K. Sankaran. A note on resolving infeasibility in linear programs by constraint relaxation , 1993, Oper. Res. Lett..

[52] Ian Parberry,et al. Circuit complexity and neural networks , 1994 .

[53] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[54] Umesh V. Vazirani,et al. An Introduction to Computational Learning Theory , 1994 .

[55] Philip M. Long,et al. Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[56] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[57] Philip M. Long,et al. A Generalization of Sauer's Lemma , 1995, J. Comb. Theory, Ser. A.

[58] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .

[59] Philip M. Long,et al. Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[60] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[61] Thomas G. Dietterich,et al. Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[62] Balas K. Natarajan,et al. Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[63] Yishay Mansour,et al. On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[64] Michael Sipser,et al. Introduction to the Theory of Computation , 1996, SIGA.

[65] L. Houck,et al. Foundations of Animal Behavior: Classic Papers with Commentaries , 1996 .

[66] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[67] Philip M. Long,et al. Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[68] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[69] Leo Breiman,et al. Bias, Variance , And Arcing Classifiers , 1996 .

[70] S. Mallat,et al. Adaptive greedy approximations , 1997 .

[71] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[72] David H. Wolpert,et al. No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[73] Noga Alon,et al. Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[74] Vladimir Cherkassky,et al. The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[75] David A. McAllester. Some PAC-Bayesian Theorems , 1998, COLT' 98.

[76] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[77] Bernhard Schölkopf,et al. Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[78] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[79] Yoav Freund,et al. Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[80] S. Ben-David,et al. Combinatorial Variability of Vapnik-chervonenkis Classes with Applications to Sample Compression Schemes , 1998, Discrete Applied Mathematics.

[81] Noboru Murata,et al. A Statistical Study on On-line Learning , 1999 .

[82] Jason Weston,et al. Support vector machines for multi-class pattern recognition , 1999, ESANN.

[83] David A. McAllester. PAC-Bayesian model averaging , 1999, COLT '99.

[84] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[85] Geoffrey J. Gordon. Regret bounds for prediction problems , 1999, COLT '99.

[86] M. Kearns,et al. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[87] Nello Cristianini,et al. An introduction to Support Vector Machines , 2000 .

[88] Adrian S. Lewis,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[89] Hans Ulrich Simon,et al. Efficient Learning of Linear Perceptrons , 2000, NIPS.

[90] V. Koltchinskii,et al. Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[91] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[92] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[93] Yoram Singer,et al. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[94] Peter L. Bartlett,et al. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[95] Bernhard Schölkopf,et al. A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[96] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[97] David J. Kriegman,et al. From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[98] Alexander J. Smola,et al. Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[99] Koby Crammer,et al. On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[100] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[101] Bernhard Schölkopf,et al. Kernel Dependency Estimation , 2002, NIPS.

[102] Matthias W. Seeger,et al. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[103] Partha Niyogi,et al. Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[104] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[105] John Shawe-Taylor,et al. PAC-Bayes & Margins , 2002, NIPS.

[106] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[107] Jon M. Kleinberg,et al. An Impossibility Theorem for Clustering , 2002, NIPS.

[108] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms , 2002 .

[109] P. Bartlett,et al. Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[110] Ben Taskar,et al. Max-Margin Markov Networks , 2003, NIPS.

[111] Manfred K. Warmuth,et al. Relating Data Compression and Learnability , 2003 .

[112] Yann LeCun,et al. Large Scale Online Learning , 2003, NIPS.

[113] David A. McAllester. Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[114] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[115] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[116] Shai Ben-David,et al. On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[117] Manfred K. Warmuth,et al. Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1995, Machine Learning.

[118] Tong Zhang,et al. Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[119] Claudio Gentile,et al. The Robustness of the p-Norm Algorithms , 2003, Machine Learning.

[120] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[121] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[122] Thomas Hofmann,et al. Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[123] Carla E. Brodley,et al. Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[124] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[125] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[126] O. Bousquet. THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[127] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[128] Trevor Darrell,et al. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[129] Yuhong Yang,et al. Information Theory, Inference, and Learning Algorithms , 2005 .

[130] R. Schapire. The Strength of Weak Learnability , 1990, Machine Learning.

[131] Michael Collins,et al. Discriminative Reranking for Natural Language Parsing , 2000, CL.

[132] P. Bartlett,et al. Local Rademacher complexities , 2005, math/0508275.

[133] Emmanuel J. Candès,et al. Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[134] B. K. Natarajan. On Learning Sets and Functions , 1989, Machine Learning.

[135] J. Langford. Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[136] S. Boucheron,et al. Theory of classification : a survey of some recent advances , 2005 .

[137] T. Poggio,et al. STABILITY RESULTS IN LEARNING THEORY , 2005 .

[138] Dan Roth,et al. Learnability of Bipartite Ranking Functions , 2005, COLT.

[139] Thorsten Joachims,et al. A support vector method for multivariate performance measures , 2005, ICML.

[140] Sayan Mukherjee,et al. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[141] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .

[142] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[143] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[144] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[145] Alexander A. Sherstov,et al. Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, FOCS.

[146] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .

[147] Gunnar Rätsch,et al. Totally corrective boosting algorithms that maximize the margin , 2006, ICML.

[148] Peng Zhao,et al. On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[149] Alexander A. Sherstov,et al. Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[150] O. Chapelle. Large margin optimization of ranking measures , 2007 .

[151] H. Robbins. A Stochastic Approximation Method , 1951 .

[152] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[153] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[154] Lior Rokach,et al. Data Mining with Decision Trees - Theory and Applications , 2007, Series in Machine Perception and Artificial Intelligence.

[155] Yoshua Bengio,et al. Scaling learning algorithms towards AI , 2007 .

[156] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[157] A. Beygelzimer. Multiclass Classification with Filter Trees , 2007 .

[158] Elad Hazan,et al. Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[159] Marc'Aurelio Ranzato,et al. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[160] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[161] Ulrike von Luxburg,et al. A tutorial on spectral clustering , 2007, Stat. Comput..

[162] Shai Shalev-Shwartz,et al. Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[163] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[164] S. V. N. Vishwanathan,et al. Entropy Regularized LPBoost , 2008, ALT.

[165] Ambuj Tewari,et al. Optimal Stragies and Minimax Lower Bounds for Online Convex Games , 2008, COLT.

[166] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[167] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[168] Nathan Srebro,et al. SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[169] R. DeVore,et al. A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[170] Ambuj Tewari,et al. On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[171] E. Candès. The restricted isometry property and its implications for compressed sensing , 2008 .

[172] Shai Ben-David,et al. Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[173] William W. Cohen,et al. Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[174] Shai Ben-David,et al. Agnostic Online Learning , 2009, COLT.

[175] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[176] William T. Freeman,et al. Informative Sensing , 2009, ArXiv.

[177] Yurii Nesterov,et al. Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[178] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[179] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[180] Ohad Shamir,et al. Stochastic Convex Optimization , 2009, COLT.

[181] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[182] Ohad Shamir,et al. Learning Kernel-Based Halfspaces with the Zero-One Loss , 2010, COLT 2010.

[183] Yoram Singer,et al. On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[184] Tong Zhang,et al. Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints , 2010, SIAM J. Optim..

[185] Ambuj Tewari,et al. Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[186] Ohad Shamir,et al. Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[187] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[188] Andrea Montanari,et al. The Noise-Sensitivity Phase Transition in Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[189] Pedro M. Domingos,et al. Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[190] Shai Ben-David,et al. Multiclass Learnability and the ERM principle , 2011, COLT.

[191] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[192] David Barber,et al. Bayesian reasoning and machine learning , 2012 .

[193] Yoav Freund,et al. Boosting: Foundations and Algorithms , 2012 .

[194] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[195] Amit Daniely,et al. Multiclass Learning Approaches: A Theoretical Comparison with Implications , 2012, NIPS.

[196] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[197] Roi Livni,et al. Honest Compressions and Their Application to Compression Schemes , 2013, COLT.

[198] Ohad Shamir,et al. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[199] Roi Livni,et al. A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[200] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[201] D. L. Donoho,et al. Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[202] Lee-Ad Gottlieb,et al. Efficient Classification for Metric Data , 2014, IEEE Trans. Inf. Theory.

[203] Andreas Holzinger,et al. Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..