论文信息 - On the Use of Grapheme Models for Searching in Large Spoken Archives

On the Use of Grapheme Models for Searching in Large Spoken Archives

This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.

[1] Mehryar Mohri,et al. Factor Automata of Automata and Applications , 2007, CIAA.

[2] Richard M. Schwartz,et al. Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3] Tanja Schultz,et al. Grapheme based speech recognition , 2003, INTERSPEECH.

[4] Karen Livescu,et al. Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[5] Jan Svec,et al. An Analysis of the RNN-Based Spoken Term Detection Training , 2017, SPECOM.

[6] Jan Svec,et al. System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive , 2011, EURASIP J. Audio Speech Music. Process..

[7] Rohit Prabhavalkar,et al. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Hermann Ney,et al. Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[9] Yun Lei,et al. Calibration and multiple system fusion for spoken term detection using linear logistic regression , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Karen Livescu,et al. Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Mark J. F. Gales,et al. Recurrent neural network language models for keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[13] Keikichi Hirose,et al. Results of aligning and reformatting the dictionary as a corpus of joint sequences . A ‘ , ’ indicates a oneto-many relationship , while ‘ , 2016 .

[14] Tanja Schultz,et al. Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[15] Murat Saraclar,et al. Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Richard M. Schwartz,et al. Constructing sub-word units for spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Geoffrey Zweig,et al. Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Kevin Knight,et al. Grapheme-to-Phoneme Models for (Almost) Any Language , 2016, ACL.

[19] Jan Svec,et al. A Relevance Score Estimation for Spoken Term Detection Based on RNN-Generated Pronunciation Embeddings , 2017, INTERSPEECH.

[20] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[21] Jan Svec,et al. An Engine for Online Video Search in Large Archives of the Holocaust Testimonies , 2016, INTERSPEECH.

[22] Xiaohui Zhang,et al. The Kaldi OpenKWS System: Improving Low Resource Keyword Search , 2017, INTERSPEECH.

[23] Sanjeev Khudanpur,et al. Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.