Channel selection measures for multi-microphone speech recognition

Automatic speech recognition in a room with distant microphones is strongly affected by noise and reverberation. In scenarios where the speech signal is captured by several arbitrarily located microphones the degree of distortion differs from one channel to another. In this work we deal with measures extracted from a given distorted signal that either estimate its quality or measure how well it fits the acoustic models of the recognition system. We then apply them to solve the problem of selecting the signal (i.e. the channel) that presumably leads to the lowest recognition error rate. New channel selection techniques are presented, and compared experimentally in reverberant environments with other approaches reported in the literature. Significant improvements in recognition rate are observed for most of the measures. A new measure based on the variance of the speech intensity envelope shows a good trade-off between recognition accuracy, latency and computational cost. Also, the combination of measures allows a further improvement in recognition rate.

[1]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[2]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[3]  Jill Fain Lehman,et al.  Channel selection based on multichannel cross-correlation coefficients for distant speech recognition , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[4]  Hermann Ney,et al.  Histogram based normalization in the acoustic feature space , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[5]  Rüdiger Hoffmann,et al.  The harming part of room acoustics in automatic speech recognition , 2007, INTERSPEECH.

[6]  Yasunari Obuchi Multiple-microphone robust speech recognition using decoder-based channel selection , 2004, SAPA@INTERSPEECH.

[7]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[8]  John W. McDonough,et al.  Multi-source far-distance microphone selection and combination for automatic transcription of lectures , 2006, INTERSPEECH.

[9]  Martin Wolf,et al.  On the potential of channel selection for recognition of reverberated speech with multiple microphones , 2010, INTERSPEECH.

[10]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Andreas Stolcke,et al.  The ICSI Meeting Project: Resources and Research , 2004 .

[13]  Antonio M. Peinado,et al.  Non-linear transformations of the feature space for robust Speech Recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[15]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Matthias Wölfel Channel selection by class separability measures for automatic transcriptions on distant microphones , 2007, INTERSPEECH.

[17]  Kazuya Takeda,et al.  Speech recognition based on space diversity using distributed multi-microphone , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[19]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[20]  Yasunari Obuchi,et al.  Noise robust speech recognition using delta-cepstrum normalization and channel selection , 2006 .