Noise-robust multi-stream fusion for text-independent speaker authentication

Multi-stream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noise-robust multi-stream text-independent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noise-robustness is mainly due to the combination mechanism. This two-step approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to sub-streams. An important finding is that a trade-off is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this trade-off, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Francis Jack Smith,et al.  Speech recognition with unknown partial feature corruption - a review of the union model , 2003, Comput. Speech Lang..

[4]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Samy Bengio,et al.  An Investigation of Spectral Subband Centroids for Speaker Authentication , 2003 .

[8]  Kuldip K. Paliwal Spectral subband centroids as features for speech recognition , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Samy Bengio,et al.  Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks? , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[12]  Samy Bengio,et al.  The expected performance curve: a new assessment measure for person authentication , 2004, Odyssey.

[13]  M. L. Shire,et al.  Discriminant Training of Front-End and Acoustic Modeling Stages to Heterogeneous Acoustic Environmen , 2000 .

[14]  Samy Bengio,et al.  Non-Linear Variance Reduction Techniques in Biometric Authentication , 2003 .

[15]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[16]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .