On the Use of Grapheme Models for Searching in Large Spoken Archives

This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.

[1]  Mehryar Mohri,et al.  Factor Automata of Automata and Applications , 2007, CIAA.

[2]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[4]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[5]  Jan Svec,et al.  An Analysis of the RNN-Based Spoken Term Detection Training , 2017, SPECOM.

[6]  Jan Svec,et al.  System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive , 2011, EURASIP J. Audio Speech Music. Process..

[7]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[9]  Yun Lei,et al.  Calibration and multiple system fusion for spoken term detection using linear logistic regression , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mark J. F. Gales,et al.  Recurrent neural network language models for keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Keikichi Hirose,et al.  Results of aligning and reformatting the dictionary as a corpus of joint sequences . A ‘ , ’ indicates a oneto-many relationship , while ‘ , 2016 .

[14]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[15]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Richard M. Schwartz,et al.  Constructing sub-word units for spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Kevin Knight,et al.  Grapheme-to-Phoneme Models for (Almost) Any Language , 2016, ACL.

[19]  Jan Svec,et al.  A Relevance Score Estimation for Spoken Term Detection Based on RNN-Generated Pronunciation Embeddings , 2017, INTERSPEECH.

[20]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[21]  Jan Svec,et al.  An Engine for Online Video Search in Large Archives of the Holocaust Testimonies , 2016, INTERSPEECH.

[22]  Xiaohui Zhang,et al.  The Kaldi OpenKWS System: Improving Low Resource Keyword Search , 2017, INTERSPEECH.

[23]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.