On feature extraction by mutual information maximization

In order to learn discriminative feature transforms, we discuss mutual information between class labels and transformed features as a criterion. Instead of Shannon's definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of “information potentials” and “information forces” induced by samples of data. This paper presents two routes towards practical usability of the method, especially aimed to large databases: The first is an on-line stochastic gradient algorithm, and the second is based on approximating class densities in the output space by Gaussian mixture models.

[1]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[2]  Daniel P. W. Ellis,et al.  Feature extraction using non-linear transformation for robust speech recognition on the Aurora database , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[4]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[5]  K. Torkkola,et al.  Nonlinear feature transforms using maximum mutual information , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[6]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[7]  Hermann Ney,et al.  Acoustic front-end optimization for large vocabulary speech recognition , 1997, EUROSPEECH.

[8]  J.C. Principe,et al.  A methodology for information theoretic feature extraction , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).