From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space

This paper addresses the problem of characterizing ensemble similarity from sample similarity in a principled manner. Using a reproducing kernel as a characterization of sample similarity, we suggest a probabilistic distance measure in the reproducing kernel Hilbert space (RKHS) as the ensemble similarity. Assuming normality in the RKHS, we derive analytic expressions for probabilistic distance measures that are commonly used in many applications, such as Chernoff distance (or the Bhattacharyya distance as its special case), Kullback-Leibler divergence, etc. Since the reproducing kernel implicitly embeds a nonlinear mapping, our approach presents a new way to study these distances whose feasibility and efficiency is demonstrated using experiments with synthetic and real examples. Further, we extend the ensemble similarity to the reproducing kernel for ensemble and study the ensemble similarity for more general data representations.

[1]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-IV: Non-Gaussian detection , 1973, IEEE Trans. Inf. Theory.

[2]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[3]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  David J. Kriegman,et al.  Video-based face recognition using probabilistic appearance manifolds , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  King-Sun Fu,et al.  Error estimation in pattern recognition via LAlpha -distance between posterior density functions , 1976, IEEE Trans. Inf. Theory.

[7]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[8]  Thomas Kailath,et al.  An RKHS approach to detection and estimation problems- III: Generalized innovations representations and a likelihood-ratio formula , 1972, IEEE Trans. Inf. Theory.

[9]  Tony Jebara,et al.  Images as bags of pixels , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  Lior Wolf,et al.  Learning over Sets using Kernel Principal Angles , 2003, J. Mach. Learn. Res..

[11]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[12]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[13]  Edward A. Patrick,et al.  Nonparametric feature selection , 1969, IEEE Trans. Inf. Theory.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  K. Matusita Decision Rules, Based on the Distance, for Problems of Fit, Two Samples, and Estimation , 1955 .

[17]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[18]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-V: Parameter estimation , 1973, IEEE Trans. Inf. Theory.

[19]  Luis E. Ortiz,et al.  Concentration Inequalities for the Missing Mass and for Histogram Rule Error , 2003, J. Mach. Learn. Res..

[20]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[21]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[22]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Lior Wolf,et al.  Kernel principal angles for classification machines with applications to image sequence interpretation , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[24]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[25]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[26]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[27]  David G. Stork,et al.  Pattern Classification , 1973 .

[28]  Trevor Darrell,et al.  Face Recognition from Long-Term Observations , 2002, ECCV.

[29]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[30]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[31]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[32]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[33]  Mehryar Mohri,et al.  Lattice kernels for spoken-dialog classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[35]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[36]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[37]  Michael I. Jordan,et al.  Learning Graphical Models with Mercer Kernels , 2002, NIPS.

[38]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[39]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[40]  Nuno Vasconcelos,et al.  The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition , 2004, ECCV.

[41]  Matthias W. Seeger,et al.  Covariance Kernels from Bayesian Generative Models , 2001, NIPS.

[42]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).