On the relation between discriminant analysis and mutual information for supervised linear feature extraction

This paper provides a unifying view of three discriminant linear feature extraction methods: linear discriminant analysis, heteroscedastic discriminant analysis and maximization of mutual information. We propose a model-independent reformulation of the criteria related to these three methods that stresses their similarities and elucidates their differences. Based on assumptions for the probability distribution of the classification data, we obtain sufficient conditions under which two or more of the above criteria coincide. It is shown that these conditions also suffice for Bayes optimality of the criteria. Our approach results in an information-theoretic derivation of linear discriminant analysis and heteroscedastic discriminant analysis. Finally, regarding linear discriminant analysis, we discuss its relation to multidimensional independent component analysis and derive suboptimality bounds based on information theory.

[1]  Douglas M. Hawkins,et al.  High-Breakdown Linear Discriminant Analysis , 1997 .

[2]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Javier Rodríguez Fonollosa,et al.  Motion estimation using higher order statistics , 1996, IEEE Trans. Image Process..

[4]  Philip M. Lewis,et al.  The characteristic selection problem in recognition systems , 1962, IRE Trans. Inf. Theory.

[5]  Kari Torkkola,et al.  Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions , 2001, NIPS.

[6]  William A. Gardner,et al.  A Unifying View of Second-Order Measures of Quality for Signal Classification , 1980, IEEE Trans. Commun..

[7]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[8]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[9]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[10]  Jian Yang,et al.  Why can LDA be performed in PCA transformed space? , 2003, Pattern Recognit..

[11]  G. Darbellay An estimator of the mutual information based on a criterion for independence , 1999 .

[12]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[13]  A. Nandi Blind estimation using higher-order statistics , 1999 .

[14]  Heinrich Niemann,et al.  Optimal linear feature transformations for semi-continuous hidden Markov models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  L. Goddard Information Theory , 1962, Nature.

[16]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[20]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[21]  John W. Fisher,et al.  Learning from Examples with Information Theoretic Criteria , 2000, J. VLSI Signal Process..

[22]  E. Oja,et al.  Independent Component Analysis , 2013 .

[23]  Ravi Kothari,et al.  Fractional-Step Dimensionality Reduction , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[25]  José Carlos Príncipe,et al.  Generalized eigendecomposition with an on-line local algorithm , 1998, IEEE Signal Processing Letters.

[26]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[27]  Jean-François Cardoso,et al.  Multidimensional independent component analysis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Josef Kittler,et al.  A Framework for Classifier Fusion: Is It Still Needed? , 2000, SSPR/SPR.

[29]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[30]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..