Information Theoretic Learning and Kernel Methods

In this chapter, we discuss important connections between two different approaches to machine learning, namely Renyi entropy-based information theoretic learning and the Mercer kernel methods. We show that Parzen windowing for estimation of probability density functions reveals the connections, enabling the information theoretic criteria to be expressed in terms of mean vectors in a Mercer kernel feature space, or equivalently, in terms of kernel matrices. From this we learn not only that two until now separate paradigms in machine learning are related, it also enables us to interpret and understand methods developed in one paradigm in terms of the other, and to develop new sophisticated machine learning algorithms based on both approaches.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[3]  O. Bousquet,et al.  Kernel methods and their potential use in signal processing , 2004, IEEE Signal Processing Magazine.

[4]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[5]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[6]  Nicolás García-Pedrajas,et al.  A cooperative constructive method for neural networks for pattern recognition , 2007, Pattern Recognit..

[7]  Robert Jenssen,et al.  Some Equivalences between Kernel Methods and Information Theoretic Methods , 2006, J. VLSI Signal Process..

[8]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[9]  A. Rényi On Measures of Entropy and Information , 1961 .

[10]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[11]  Robert Jenssen,et al.  The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space , 2004, NIPS.

[12]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[13]  Ivor W. Tsang,et al.  The pre-image problem in kernel methods , 2003, IEEE Transactions on Neural Networks.

[14]  Robert Jenssen,et al.  Information cut for clustering using a gradient descent approach , 2007, Pattern Recognit..

[15]  Deniz Erdogmus,et al.  Feature extraction using information-theoretic learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rabab K. Ward,et al.  14 FROM LINEAR ADAPTIVE FILTERING TO NONLINEAR INFORMATION PROCESSING , 2006 .

[17]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[18]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[19]  Deniz Erdogmus,et al.  Information Theoretic Learning , 2005, Encyclopedia of Artificial Intelligence.

[20]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[21]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[22]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[23]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Deniz Erdogmus,et al.  A mutual information extension to the matched filter , 2005, Signal Process..

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  J.C. Principe,et al.  From linear adaptive filtering to nonlinear information processing - The design and analysis of information processing systems , 2006, IEEE Signal Processing Magazine.

[28]  Robert Jenssen,et al.  The Laplacian Classifier , 2007, IEEE Transactions on Signal Processing.

[29]  Robert Jenssen,et al.  A new information theoretic analysis of sum-of-squared-error kernel clustering , 2008, Neurocomputing.

[30]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[31]  Robert Jenssen,et al.  Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm , 2006, NIPS.

[32]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[33]  Deniz Erdogmus,et al.  An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems , 2002, IEEE Trans. Signal Process..

[34]  Deniz Erdoğmuş,et al.  Blind source separation using Renyi's mutual information , 2001, IEEE Signal Processing Letters.

[35]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.