Introduction to Statistical Machine Learning

Machine learning allows computers to learn and discern patterns without actually being programmed. When Statistical techniques and machine learning are combined together they are a powerful tool for analysing various kinds of data in many computer science/engineering areas including, image processing, speech processing, natural language processing, robot control, as well as in fundamental sciences such as biology, medicine, astronomy, physics, and materials. Introduction to Statistical Machine Learning provides a general introduction to machine learning that covers a wide range of topics concisely and will help you bridge the gap between theory and practice. Part I discusses the fundamental concepts of statistics and probability that are used in describing machine learning algorithms. Part II and Part III explain the two major approaches of machine learning techniques; generative methods and discriminative methods. While Part III provides an in-depth look at advanced topics that play essential roles in making machine learning algorithms more useful in practice. The accompanying MATLAB/Octave programs provide you with the necessary practical skills needed to accomplish a wide range of data analysis tasks.Provides the necessary background material to understand machine learning such as statistics, probability, linear algebra, and calculus.Complete coverage of the generative approach to statistical pattern recognition and the discriminative approach to statistical machine learning.Includes MATLAB/Octave programs so that readers can test the algorithms numerically and acquire both mathematical and practical skills in a wide range of data analysis tasksDiscusses a wide range of applications in machine learning and statistics and provides examples drawn from image processing, speech processing, natural language processing, robot control, as well as biology, medicine, astronomy, physics, and materials.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[3]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[4]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[5]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[6]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[7]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[8]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[10]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[11]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[14]  P. Holland,et al.  Robust regression using iteratively reweighted least-squares , 1977 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[17]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[18]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[20]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[21]  Adrian F. M. Smith,et al.  Bayesian computation via the gibbs sampler and related markov chain monte carlo methods (with discus , 1993 .

[22]  G. Wahba Spline models for observational data , 1990 .

[23]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[24]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[25]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[26]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[27]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[28]  Vasile Sima,et al.  Algorithms for Linear-Quadratic Optimization , 2021 .

[29]  G. Kitagawa,et al.  Generalised information criteria in model selection , 1996 .

[30]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[33]  Emile H. L. Aarts,et al.  Boltzmann machines , 1998 .

[34]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[35]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[36]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[37]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[38]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[39]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[40]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[41]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[42]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[43]  I. Jolliffe Principal Component Analysis , 2002 .

[44]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[45]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[46]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[47]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[48]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[49]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[50]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[53]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[54]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[55]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[56]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[57]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[58]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[59]  Martin J. Wainwright,et al.  ON surrogate loss functions and f-divergences , 2005, math/0510521.

[60]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[61]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[62]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[63]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[64]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[65]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[66]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[67]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[68]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[69]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[70]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[71]  Kazuyuki Aihara,et al.  Classifying matrices with a spectral regularization , 2007, ICML '07.

[72]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[73]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[74]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[75]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[76]  Bernhard Schölkopf,et al.  Characteristic Kernels on Groups and Semigroups , 2008, NIPS.

[77]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[78]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[79]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[80]  Shinichi Nakajima,et al.  Semi-supervised local Fisher discriminant analysis for dimensionality reduction , 2009, Machine Learning.

[81]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[82]  Sumio Watanabe Algebraic Geometry and Statistical Learning Theory , 2009 .

[83]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[84]  Lior Rokach,et al.  Recommender Systems Handbook , 2010 .

[85]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[86]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[87]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[88]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[89]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[90]  Masashi Sugiyama,et al.  Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching , 2012, ICML.

[91]  Masashi Sugiyama,et al.  Sequential change‐point detection based on direct density‐ratio estimation , 2012, Stat. Anal. Data Min..

[92]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[93]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[94]  Maria L. Rizzo,et al.  Energy statistics: A class of statistics based on distances , 2013 .

[95]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[96]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[97]  Masashi Sugiyama,et al.  Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation , 2013, Neural Computation.

[98]  Sugiyama Masashi,et al.  Coping with Class Balance Change in Classification : Class-Prior Estimation with Energy Distance , 2014 .

[99]  Masashi Sugiyama,et al.  Statistical Reinforcement Learning - Modern Machine Learning Approaches , 2015, Chapman and Hall / CRC machine learning and pattern recognition series.

[100]  Masashi Sugiyama,et al.  Direct Estimation of the Derivative of Quadratic Mutual Information with Application in Supervised Dimension Reduction , 2017, Neural Computation.