Learning via Hilbert Space Embedding of Distributions

In this thesis, we propose a framework for learning based on Hilbert space embedding of distributions. By embedding the distributions via the kernel mean map, we are able to compare distributions by computing their distance in the reproducing Hilbert spaces. We show that learning via mean map leads to both good generalization ability and finite sample convergence. Using this distance between distributions, we are able to tackle a wide range of learning problems under a common framework. Very often this new view leads to simpler and more effective algorithms in various learning problems. In particular, this thesis focuses on a measure of dependence based on the mean map, and we show that it can tackle a raft of learning problems from which we singled out four concrete examples and discussed in details in the thesis: • Independence measure and test for structured and heterogeneous data. • Feature selection via dependence for supervised learning scenario. • Clustering via dependence with additional metric on labels. • Dimensionality reduction via dependence with side information. We also show that learning via Hilbert space embedding/dependence subsumes many existing algorithms as special cases. By elucidating the differences and connections of these algorithms, we are able to provide useful guidelines for practitioners in various applications. This embedding approach for distribution analysis offers us a principled drop-in replacement for information theoretic approaches. We believe it will have wide applications in near future.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[6]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[7]  C. Baker Joint measures and cross-covariance operators , 1973 .

[8]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[13]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[14]  P. Gruber,et al.  Funktionen von beschränkter Variation in der Theorie der Gleichverteilung , 1990 .

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Jim Freeman Probability Metrics and the Stability of Stochastic Models , 1991 .

[17]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[18]  E. Giné,et al.  On the Bootstrap of $U$ and $V$ Statistics , 1992 .

[19]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[20]  A. Feuerverger,et al.  A Consistent Test for Bivariate Dependence , 1993 .

[21]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[22]  Mihalis Yannakakis,et al.  The Complexity of Multiterminal Cuts , 1994, SIAM J. Comput..

[23]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[24]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[25]  Bernhard Schölkopf,et al.  A New Method for Constructing Artificial Neural Networks , 1995 .

[26]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[27]  A. Müller Stochastic Orders Generated by Integrals: a Unified Study , 1997, Advances in Applied Probability.

[28]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[29]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[31]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[32]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[33]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[34]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[35]  Bernhard Schölkopf,et al.  Shrinking the Tube: A New Support Vector Regression Algorithm , 1998, NIPS.

[36]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[37]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[38]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[39]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Si Wu,et al.  An information-geometrical method for improving the performance of support vector machine classifiers , 1999 .

[41]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[42]  Maja Pantic,et al.  Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[44]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[45]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[46]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[47]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[48]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[50]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[51]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[53]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[54]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[55]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[56]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[57]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[58]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[59]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[60]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[61]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[63]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[64]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[65]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[66]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[67]  Alexander J. Smola,et al.  Minimal Kernel Classifiers , 2002, J. Mach. Learn. Res..

[68]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[69]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[70]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[71]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[72]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[73]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[75]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[76]  N. Iizuka,et al.  MECHANISMS OF DISEASE Mechanisms of disease , 2022 .

[77]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[78]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[79]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[80]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[81]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[82]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[83]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[84]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[85]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[86]  Marina Meila,et al.  Data centering in feature space , 2003, AISTATS.

[87]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[88]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[89]  R. Kondor,et al.  Bhattacharyya and Expected Likelihood Kernels , 2003 .

[90]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[91]  Klaus-Robert Müller,et al.  Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms , 2004, IEEE Transactions on Biomedical Engineering.

[92]  Tony Jebara,et al.  Kernelizing Sorting, Permutation, and Alignment for Minimum Volume PCA , 2004, COLT.

[93]  Alexander Kraskov,et al.  Least-dependent-component analysis based on mutual information. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[94]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[95]  Pradeep Ravikumar,et al.  Variational Chernoff Bounds for Graphical Models , 2004, UAI.

[96]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[97]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[98]  R. Verhaak,et al.  Prognostically useful gene-expression profiles in acute myeloid leukemia. , 2004, The New England journal of medicine.

[99]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[100]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[101]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[102]  Robert Krauthgamer,et al.  Approximate classification via earthmover metrics , 2004, SODA '04.

[103]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[104]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[105]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[106]  Art B Owen,et al.  A quasi-Monte Carlo Metropolis algorithm. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[107]  Amnon Shashua,et al.  A unifying approach to hard and probabilistic clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[108]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[109]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[110]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[111]  Juha Karvanen,et al.  A Resampling Test for the Total Independence of Stationary Time Series: Application to the Performance Evaluation of ICA Algorithms , 2005, Neural Processing Letters.

[112]  Gabriele Steidl,et al.  Combined SVM-Based Feature Selection and Classification , 2005, Machine Learning.

[113]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[114]  Alexander J. Smola,et al.  Kernel methods and the exponential family , 2006, ESANN.

[115]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[116]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[117]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[118]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[119]  M. West,et al.  Patterns of Gene Expression That Characterize Long-term Survival in Advanced Stage Serous Ovarian Cancers , 2005, Clinical Cancer Research.

[120]  Klaus-Robert Müller,et al.  Spatio-spectral filters for improving the classification of single trial EEG , 2005, IEEE Transactions on Biomedical Engineering.

[121]  Terrence L. Fine,et al.  Testing for stochastic independence: application to blind source separation , 2005, IEEE Transactions on Signal Processing.

[122]  Kilian Q. Weinberger,et al.  An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding , 2006, AAAI.

[123]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[124]  Kilian Q. Weinberger,et al.  Graph Laplacian Regularization for Large-Scale Semidefinite Programming , 2006, NIPS.

[125]  Aranyak Mehta,et al.  On earthmover distance, metric labeling, and 0-extension , 2006, STOC '06.

[126]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[127]  Terrence J. Sejnowski,et al.  Spatio-temporal dynamics in fMRI recordings revealed with complex independent component analysis , 2006, Neurocomputing.

[128]  Conrad Sanderson,et al.  An Efficient Alternative to SVM Based Recursive Feature Elimination with Applications in Natural Language Processing and Bioinformatics , 2006, Australian Conference on Artificial Intelligence.

[129]  Fabian J. Theis,et al.  Towards a general independent subspace analysis , 2006, NIPS.

[130]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[131]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[132]  Miroslav Dudík,et al.  Maximum Entropy Distribution Estimation with Generalized Regularization , 2006, COLT.

[133]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[134]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[135]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[136]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[137]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[138]  Stephen P. Boyd,et al.  The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem , 2006, SIAM Rev..

[139]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[140]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[141]  Hao Shen,et al.  Fast Kernel ICA using an Approximate Newton Method , 2007, AISTATS.

[142]  C. P. Gupta,et al.  Applications of Mathematics , 2007 .

[143]  Kenji Fukumizu,et al.  Statistical Consistency of Kernel Canonical Correlation Analysis , 2007 .

[144]  Le Song,et al.  A dependence maximization view of clustering , 2007, ICML '07.

[145]  Le Song,et al.  Gene selection via the BAHSIC family of algorithms , 2007, ISMB/ECCB.

[146]  Le Song,et al.  Colored Maximum Variance Unfolding , 2007, NIPS.

[147]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[148]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[149]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.