Probability Product Kernels

The advantages of discriminative learning algorithms and kernel machines are combined with generative modeling using a novel kernel between distributions. In the probability product kernel, data points in the input space are mapped to distributions over the sample space and a general inner product is then evaluated as the integral of the product of pairs of distributions. The kernel is straightforward to evaluate for all exponential family models such as multinomials and Gaussians and yields interesting nonlinear kernels. Furthermore, the kernel is computable in closed form for latent distributions such as mixture models, hidden Markov models and linear dynamical systems. For intractable models, such as switching linear dynamical systems, structured mean-field approximations can be brought to bear on the kernel evaluation. For general distributions, even if an analytic expression for the kernel is not feasible, we show a straightforward sampling method to evaluate it. Thus, the kernel permits discriminative learning methods, including support vector machines, to exploit the properties, metrics and invariances of the generative models we infer from each datum. Experiments are shown using multinomial models for text, hidden Markov models for biological data sets and linear dynamical systems for time series data.

[1]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[2]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[5]  J. MacKinnon,et al.  Estimation and inference in econometrics , 1994 .

[6]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[7]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  M. Goldszmidt,et al.  A Probabilistic Approach to Full-Text Document Clustering , 1998 .

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[14]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[15]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[18]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[19]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[20]  P. Gehler,et al.  An introduction to graphical models , 2001 .

[21]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[22]  Mehryar Mohri,et al.  Rational Kernels , 2002, NIPS.

[23]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[24]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[25]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[26]  John D. Lafferty,et al.  Information Diffusion Kernels , 2002, NIPS.

[27]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[28]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[29]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[30]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[31]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[32]  R. Kondor,et al.  Bhattacharyya and Expected Likelihood Kernels , 2003 .

[33]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[34]  Alexander J. Smola,et al.  Learning with non-positive kernels , 2004, ICML.