Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces

The common problem faced by many data processing professionals is how to best extract the information contained in data. In our daily lives and in our professions, we are bombarded by huge amounts of data, but most often data are not our primary interest. Data hides, either in time structure or in spatial redundancy, important clues to answer the information-processing questions we pose. We are using the term information in the colloquial sense, and therefore it may mean different things to different people, which is OK for now. We all realize that the use of computers and the Web accelerated tremendously the accessibility and the amount of data being generated. Therefore the pressure to distill information from data will mount at an increasing pace in the future, and old ways of dealing with this problem will be forced to evolve and adapt to the new reality. To many (including the author) this represents nothing less than a paradigm shift, from hypothesis-based, to evidence-based science and it will affect the core design strategies in many disciplines including learning theory and adaptive systems.

[1]  G. Deco,et al.  An Information-Theoretic Approach to Neural Computing , 1997, Perspectives in Neural Computing.

[2]  Weifeng Liu,et al.  The Kernel Least-Mean-Square Algorithm , 2008, IEEE Transactions on Signal Processing.

[3]  Ralph Linsker,et al.  Towards an Organizing Principle for a Layered Perceptual Network , 1987, NIPS.

[4]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[5]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[6]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[7]  Leandro Pardo,et al.  Asymptotic distribution of (h, φ)-entropies , 1993 .

[8]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[9]  K. Fu 2 Statistical Pattern Recognition , 1970 .

[11]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[12]  László Máté Hilbert Space Methods in Science and Engineering , 1990 .

[13]  R. Hartley Transmission of information , 1928 .

[14]  J.C. Principe,et al.  From linear adaptive filtering to nonlinear information processing - The design and analysis of information processing systems , 2006, IEEE Signal Processing Magazine.

[15]  R. D. Figueiredo A generalized Fock space framework for nonlinear system and signal analysis , 1983 .

[16]  Variable location and scale density estimation , 1994 .

[17]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[18]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[19]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[20]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[21]  Satosi Watanabe,et al.  Pattern Recognition: Human and Mechanical , 1985 .

[22]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[23]  Thomas Kailath,et al.  An RKHS approach to detection and estimation problems- III: Generalized innovations representations and a likelihood-ratio formula , 1972, IEEE Trans. Inf. Theory.

[24]  Leandro Pardo,et al.  Asymptotic behaviour and statistical applications of divergence measures in multinomial populations: a unified study , 1995 .

[25]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[26]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[27]  Paul A. Viola,et al.  Learning Informative Statistics: A Nonparametnic Approach , 1999, NIPS.

[28]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[29]  Geoffrey E. Hinton,et al.  Unsupervised learning : foundations of neural computation , 1999 .

[30]  Chong-Yung Chi,et al.  Cumulant-based inverse filter criteria for MIMO blind deconvolution: properties, algorithms, and application to DS/CDMA systems in multipath , 2001, IEEE Trans. Signal Process..

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[33]  J. Aczel,et al.  On Measures of Information and Their Characterizations , 2012 .

[34]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[35]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[36]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[37]  Mark Girolami,et al.  Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem , 2002, Neural Computation.

[38]  Par N. Aronszajn La théorie des noyaux reproduisants et ses applications Première Partie , 1943, Mathematical Proceedings of the Cambridge Philosophical Society.

[39]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[40]  E. Oja,et al.  Independent Component Analysis , 2013 .

[41]  G. Wahba Spline models for observational data , 1990 .

[42]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[43]  Michel Loève,et al.  Probability Theory I , 1977 .

[44]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[45]  C. R. Rao,et al.  Entropy differential metric, distance and divergence measures in probability spaces: A unified approach , 1982 .

[46]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[47]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[48]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[49]  N. Kemmer,et al.  The Theory of Space, Time and Gravitation , 1964 .

[50]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[51]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference: Kass/Geometrical , 1997 .

[52]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[53]  Thomas Kailath,et al.  An RKHS approach to detection and estimation problems-II: Gaussian signal detection , 1975, IEEE Trans. Inf. Theory.

[54]  K. Loparo,et al.  Optimal state estimation for stochastic systems: an information theoretic approach , 1997, IEEE Trans. Autom. Control..

[55]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[56]  Christian Jutten,et al.  Source separation techniques applied to linear prediction , 2000 .

[57]  N. Wiener,et al.  Nonlinear Problems in Random Theory , 1964 .

[58]  Jagat Narain Kapur,et al.  Measures of information and their applications , 1994 .

[59]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[60]  Baver Okutmustur Reproducing kernel Hilbert spaces , 2005 .