Speech Processing for Audio Indexing

This paper addresses some of the recent trends in speech processing, with a focus on speech-to-text transcription as a means to facilitate access to multimedia information in a multilingual context. A brief overview of automatic speech recognition is given along with indicative performance measures for a range of tasks. Enriched transcriptions, that is enhancing the automatic word transcripts with meta-data derived from the audio data is discussed, followed by some hightlights of recent progress and remaining challenges in speech recognition.

[1]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[3]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[4]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[5]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[6]  Jean-Luc Gauvain,et al.  Speaker Diarization: From Broadcast News to Lectures , 2006, MLMI.

[7]  Joseph Picone,et al.  Benchmarking human performance for continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Lori Lamel,et al.  Pronunciation variants across system configuration, language and speaking style , 1999, Speech Commun..

[9]  Andrei Popescu-Belis,et al.  Machine Learning for Multimodal Interaction , 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers , 2008, MLMI.

[10]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[11]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[12]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[13]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[14]  Michael J. Swain,et al.  SpeechBot: a Speech Recognition based Audio Indexing System for the Web , 2000, RIAO.

[15]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[16]  George Zavaliagkos,et al.  Using untranscribed training data to improve performance , 1998, ICSLP.

[17]  Olivier Galibert,et al.  The LIMSI 2006 TC-STAR EPPS Transcription Systems , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[19]  Thomas Pellegrini,et al.  Experimental detection of vowel pronunciation variants in Amharic , 2006, LREC.

[20]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[21]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[22]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[23]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[24]  Wayne H. Ward,et al.  Speech recognition , 1997 .

[25]  Herman J. M. Steeneken,et al.  Human benchmarks for speaker independent large vocabulary recognition performance , 1995, EUROSPEECH.

[26]  N. R. Dixon,et al.  Preliminary results on the performance of a system for the automatic recognition of continuous speech , 1976, ICASSP.

[27]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  John Makhoul,et al.  Using quick transcriptions to improve conversational speech models , 2004, INTERSPEECH.

[29]  Olivier Galibert,et al.  Speech transcription in multiple languages , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[31]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..