Feature Extraction by Non-Parametric Mutual Information Maximization

We present a method for learning discriminative feature transforms using as criterion the mutual information between class labels and transformed features. Instead of a commonly used mutual information measure based on Kullback-Leibler divergence, we use a quadratic divergence measure, which allows us to make an efficient non-parametric implementation and requires no prior assumptions about class densities. In addition to linear transforms, we also discuss nonlinear transforms that are implemented as radial basis function networks. Extensions to reduce the computational complexity are also presented, and a comparison to greedy feature selection is made.

[1]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[2]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[3]  Edward A. Patrick,et al.  Nonparametric feature selection , 1969, IEEE Trans. Inf. Theory.

[4]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[5]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[6]  Shingo Tomita,et al.  An optimal orthonormal system for discriminant analysis , 1985, Pattern Recognit..

[7]  A. Hillion,et al.  A nonparametric approach to linear feature extraction; application to classification of binary synthetic textures , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[8]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[9]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[10]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[11]  Jorma Laaksonen,et al.  LVQPAK: A software package for the correct application of Learning Vector Quantization algorithms , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[12]  T. Kohonen,et al.  Appendix 2.4 Stopping Rule 2.3 Fine Tuning Using the Basic Lvq1 or Lvq2.1 Lvq Pak: a Program Package for the Correct Application of Learning Vector Quantization Algorithms , 1992 .

[13]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[16]  A. S. Weigend,et al.  Selecting Input Variables Using Mutual Information and Nonparemetric Density Estimation , 1994 .

[17]  Jagat Narain Kapur,et al.  Measures of information and their applications , 1994 .

[18]  Joydeep Ghosh,et al.  Linear feature extractors based on mutual information , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[19]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[20]  Guorong Xuan,et al.  Bhattacharyya distance feature selection , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[21]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[22]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[23]  Andrzej Cichocki,et al.  A common neural-network model for unsupervised exploratory data analysis and independent component analysis , 1998, IEEE Trans. Neural Networks.

[24]  Mayer Aladjem Nonparametric discriminant analysis via recursive optimization of Patrick-Fisher distance , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[25]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[26]  László Györfi,et al.  Lower Bounds for Bayes Error Estimation , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  George Saon,et al.  Minimum Bayes Error Feature Selection for Continuous Speech Recognition , 2000, NIPS.

[28]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[29]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[30]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[31]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[32]  K. Torkkola,et al.  Nonlinear feature transforms using maximum mutual information , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[33]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[34]  Kari Torkkola,et al.  Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions , 2001, NIPS.

[35]  J. D. Gorman,et al.  Alpha-Divergence for Classification, Indexing and Retrieval (Revised 2) , 2002 .

[36]  Kari Torkkola,et al.  On feature extraction by mutual information maximization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.