A Conditional Entropy Minimization Criterion for Dimensionality Reduction and Multiple Kernel Learning

Reducing the dimensionality of high-dimensional data without losing its essential information is an important task in information processing. When class labels of training data are available, Fisher discriminant analysis (FDA) has been widely used. However, the optimality of FDA is guaranteed only in a very restricted ideal circumstance, and it is often observed that FDA does not provide a good classification surface for many real problems. This letter treats the problem of supervised dimensionality reduction from the viewpoint of information theory and proposes a framework of dimensionality reduction based on class-conditional entropy minimization. The proposed linear dimensionality-reduction technique is validated both theoretically and experimentally. Then, through kernel Fisher discriminant analysis (KFDA), the multiple kernel learning problem is treated in the proposed framework, and a novel algorithm, which iteratively optimizes the parameters of the classification function and kernel combination coefficients, is proposed. The algorithm is experimentally shown to be comparable to or outperforms KFDA for large-scale benchmark data sets, and comparable to other multiple kernel learning techniques on the yeast protein function annotation task.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Ran He,et al.  Robust Discriminant Analysis Based on Nonparametric Maximum Entropy , 2009, ACML.

[3]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[4]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[5]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[6]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[7]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[8]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[9]  Dit-Yan Yeung,et al.  Heteroscedastic Probabilistic Linear Discriminant Analysis with Semi-supervised Extension , 2009, ECML/PKDD.

[10]  Melanie Hilario,et al.  Margin and Radius Based Multiple Kernel Learning , 2009, ECML/PKDD.

[11]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[12]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[13]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[14]  William Stafford Noble,et al.  Nonstationary kernel combination , 2006, ICML.

[15]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[16]  William Stafford Noble,et al.  Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure , 2006, Bioinform..

[17]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[18]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[19]  J. Príncipe,et al.  Entropy manipulation of arbitrary nonlinear mappings , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[20]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[21]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[24]  José Carlos Príncipe,et al.  An introduction to information theoretic learning , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[25]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[26]  Samuel Kaski,et al.  Informative Discriminant Analysis , 2003, ICML.

[27]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Jacob Goldberger,et al.  ICA based on a Smooth Estimation of the Differential Entropy , 2008, NIPS.

[30]  Robert P. W. Duin,et al.  Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[32]  David G. Stork,et al.  Pattern Classification , 1973 .

[33]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[34]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[35]  Alon Orlitsky,et al.  Supervised dimensionality reduction using mixture models , 2005, ICML.

[36]  Samuel Kaski,et al.  Fast Semi-Supervised Discriminative Component Analysis , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[37]  Pierre Comon Independent component analysis - a new concept? signal processing , 1994 .

[38]  Antonio Artés-Rodríguez,et al.  A Gaussian Mixture Based Maximization of Mutual Information for Supervised Feature Extraction , 2004, ICA.

[39]  Deniz Erdogmus,et al.  Feature extraction using information-theoretic learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  R. Cook,et al.  Theory & Methods: Special Invited Paper: Dimension Reduction and Visualization in Discriminant Analysis (with discussion) , 2001 .

[41]  Jue Wang,et al.  Recursive Support Vector Machines for Dimensionality Reduction , 2008, IEEE Transactions on Neural Networks.

[42]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[43]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[44]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[45]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.