Machine Learning with Squared-Loss Mutual Information

Mutual information (MI) is useful for detecting statistical independence between random variables, and it has been successfully applied to solving various machine learning problems. Recently, an alternative to MI called squared-loss MI (SMI) was introduced. While ordinary MI is the Kullback–Leibler divergence from the joint distribution to the product of the marginal distributions, SMI is its Pearson divergence variant. Because both the divergences belong to the ƒ-divergence family, they share similar theoretical properties. However, a notable advantage of SMI is that it can be approximated from data in a computationally more efficient and numerically more stable way than ordinary MI. In this article, we review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques such as independence testing, dimensionality reduction, canonical dependency analysis, independent component analysis, object matching, clustering, and causal inference.

[1]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[2]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[3]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[4]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[5]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[6]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[7]  R. H. Moore,et al.  Regression Graphics: Ideas for Studying Regressions Through Graphics , 1998, Technometrics.

[8]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[9]  Masashi Sugiyama,et al.  Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis , 2011, AAAI.

[10]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[11]  Masashi Sugiyama,et al.  Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation , 2012 .

[12]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[13]  Jacob Goldberger,et al.  Nonparametric Information Theoretic Clustering Algorithm , 2010, ICML.

[14]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[15]  Bernhard Schölkopf,et al.  Nonlinear causal discovery with additive noise models , 2008, NIPS.

[16]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[17]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[18]  M. C. Jones,et al.  Robust and efficient estimation by minimising a density power divergence , 1998 .

[19]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[20]  Robert A. Lordo,et al.  Nonparametric and Semiparametric Models , 2005, Technometrics.

[21]  Xiangrong Yin,et al.  Canonical correlation analysis based on information theory , 2004 .

[22]  Masashi Sugiyama,et al.  Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation , 2010, Neural Computation.

[23]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[24]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[25]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[26]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[27]  Takafumi Kanamori,et al.  Density-Difference Estimation , 2012, Neural Computation.

[28]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[29]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[30]  Masashi Sugiyama,et al.  A computationally-efficient alternative to kernel logistic regression , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[31]  Masashi Sugiyama,et al.  Cross-Domain Object Matching with Model Selection , 2011, AISTATS.

[32]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[33]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[34]  Masashi Sugiyama,et al.  Least-Squares Independence Test , 2011, IEICE Trans. Inf. Syst..

[35]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[36]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[37]  R. Cook Save: a method for dimension reduction and graphics in regression , 2000 .

[38]  Tony Jebara,et al.  Kernelizing Sorting, Permutation, and Alignment for Minimum Volume PCA , 2004, COLT.

[39]  S. Saigal,et al.  Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[40]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[41]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[42]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[43]  David Barber,et al.  Kernelized Infomax Clustering , 2005, NIPS.

[44]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[45]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[46]  David Heckerman,et al.  Learning Gaussian Networks , 1994, UAI.

[47]  Takafumi Kanamori,et al.  Least-squares two-sample test , 2011, Neural Networks.

[48]  M. Patriksson Nonlinear Programming and Variational Inequality Problems , 1999 .

[49]  Takafumi Kanamori,et al.  Least-Squares Conditional Density Estimation , 2010, IEICE Trans. Inf. Syst..

[50]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[51]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[52]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[53]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[54]  Johan A. K. Suykens,et al.  Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines , 2001, ICANN.

[55]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[56]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[57]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[58]  Motoaki Kawanabe,et al.  Dimensionality reduction for density ratio estimation in high-dimensional spaces , 2010, Neural Networks.

[59]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[60]  Marc M. Van Hulle,et al.  Sequential Fixed-Point ICA Based on Mutual Information Minimization , 2008, Neural Computation.

[61]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[62]  Takafumi Kanamori,et al.  $f$ -Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models , 2010, IEEE Transactions on Information Theory.

[63]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[64]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[65]  Le Song,et al.  A dependence maximization view of clustering , 2007, ICML '07.

[66]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[67]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[68]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[69]  Masashi Sugiyama,et al.  Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation , 2009, J. Mach. Learn. Res..

[70]  Aapo Hyv Fast and Robust Fixed-Point Algorithms for Independent Component Analysis , 1999 .

[71]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[72]  Le Song,et al.  Kernelized Sorting , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  M. V. Van Hulle Edgeworth approximation of multivariate differential entropy. , 2005, Neural computation.

[74]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[75]  Bernhard Schölkopf,et al.  Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[76]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[77]  Hal Daumé,et al.  Kernelized Sorting for Natural Language Processing , 2010, AAAI.

[78]  Miguel Á. Carreira-Perpiñán,et al.  Fast nonparametric clustering with Gaussian blurring mean-shift , 2006, ICML.

[79]  Masashi Sugiyama,et al.  Sequential change‐point detection based on direct density‐ratio estimation , 2012, Stat. Anal. Data Min..

[80]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[81]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[82]  Motoaki Kawanabe,et al.  Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search , 2011, Neural Networks.

[83]  Masashi Sugiyama,et al.  Change-point detection in time-series data by relative density-ratio estimation , 2012 .

[84]  Masashi Sugiyama,et al.  Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching , 2012, ICML.

[85]  Takafumi Kanamori,et al.  Computational complexity of kernel-based density-ratio estimation: a condition number analysis , 2012, Machine Learning.

[86]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[87]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[88]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[89]  Shrikanth S. Narayanan,et al.  Universal Consistency of Data-Driven Partitions for Divergence Estimation , 2007, 2007 IEEE International Symposium on Information Theory.

[90]  Takafumi Kanamori,et al.  Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[91]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[92]  Masashi Sugiyama,et al.  On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution , 2011, ICML.

[93]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[94]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[95]  Masashi Sugiyama,et al.  Dependence-Maximization Clustering with Least-Squares Mutual Information , 2011, J. Adv. Comput. Intell. Intell. Informatics.

[96]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[97]  Takafumi Kanamori,et al.  Statistical analysis of kernel-based least-squares density-ratio estimation , 2012, Machine Learning.

[98]  Masashi Sugiyama,et al.  Canonical dependency analysis based on squared-loss mutual information , 2011, Neural Networks.

[99]  Zaïd Harchaoui,et al.  DIFFRAC: a discriminative and flexible framework for clustering , 2007, NIPS.

[100]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[101]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[102]  Masashi Sugiyama,et al.  Superfast-Trainable Multi-Class Probabilistic Classifier by Least-Squares Posterior Fitting , 2010, IEICE Trans. Inf. Syst..

[103]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[104]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[105]  Masashi Sugiyama,et al.  Least-Squares Independent Component Analysis , 2011, Neural Computation.

[106]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[107]  Fernando Pérez-Cruz,et al.  Kullback-Leibler divergence estimation of continuous distributions , 2008, 2008 IEEE International Symposium on Information Theory.

[108]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[109]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[110]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[111]  Masashi Sugiyama,et al.  Feature Selection via L1-Penalized Squared-Loss Mutual Information , 2012, IEICE Trans. Inf. Syst..

[112]  Masashi Sugiyama,et al.  Suffcient Component Analysis , 2011, ACML.

[113]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[114]  Shotaro Akaho,et al.  Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold , 2005, Neurocomputing.

[115]  Masashi Sugiyama,et al.  Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise , 2010, AAAI.

[116]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[117]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .