A comparison of grapheme and phoneme-based units for Spanish spoken term detection

The ever-increasing volume of audio data available online through the world wide web means that automatic methods for indexing and search are becoming essential. Hidden Markov model (HMM) keyword spotting and lattice search techniques are the two most common approaches used by such systems. In keyword spotting, models or templates are defined for each search term prior to accessing the speech and used to find matches. Lattice search (referred to as spoken term detection), uses a pre-indexing of speech data in terms of word or sub-word units, which can then quickly be searched for arbitrary terms without referring to the original audio. In both cases, the search term can be modelled in terms of sub-word units, typically phonemes. For in-vocabulary words (i.e. words that appear in the pronunciation dictionary), the letter-to-sound conversion systems are accepted to work well. However, for out-of-vocabulary (OOV) search terms, letter-to-sound conversion must be used to generate a pronunciation for the search term. This is usually a hard decision (i.e. not probabilistic and with no possibility of backtracking), and errors introduced at this step are difficult to recover from. We therefore propose the direct use of graphemes (i.e., letter-based sub-word units) for acoustic modelling. This is expected to work particularly well in languages such as Spanish, where despite the letter-to-sound mapping being very regular, the correspondence is not one-to-one, and there will be benefits from avoiding hard decisions at early stages of processing. In this article, we compare three approaches for Spanish keyword spotting or spoken term detection, and within each of these we compare acoustic modelling based on phone and grapheme units. Experiments were performed using the Spanish geographical-domain Albayzin corpus. Results achieved in the two approaches proposed for spoken term detection show us that trigrapheme units for acoustic modelling match or exceed the performance of phone-based acoustic models. In the method proposed for keyword spotting, the results achieved with each acoustic model are very similar.

[1]  Marc Schröder,et al.  Cross-language phonemisation in German text-to-speech synthesis , 2007, INTERSPEECH.

[2]  Heriberto Cuayáhuitl,et al.  Out-of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems , 2002, MICAI.

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Ravi P. Ramachandran,et al.  Modern Methods of Speech Processing (The International Series in Engineering and Computer Science) , 2001 .

[5]  Javier Tejedor,et al.  SPANISH KEYWORD SPOTTING SYSTEM BASED ON FILLER MODELS, PSEUDO N-GRAM LANGUAGE MODEL AND A CONFIDENCE MEASURE , 2006 .

[6]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[7]  Yoshiaki Itoh,et al.  Speech data retrieval system constructed on a universal phonetic code domain , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[8]  Howard D. Wactlar,et al.  Indexing and search of multimodal information , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Antonio Quilis El comentario fonológico y fonético de textos : teoría y práctica , 1985 .

[10]  José B. Mariño,et al.  Out-of-vocabulary word modelling and rejection for keyword spotting , 1993, EUROSPEECH.

[11]  J. Scott Olsson,et al.  Fast Unconstrained Audio Search in Numerous Human Languages , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[13]  Peng Yu,et al.  Vocabulary-independent indexing of spontaneous speech , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  José B. Mariño,et al.  Albayzin speech database: design of the phonetic corpus , 1993, EUROSPEECH.

[15]  Karen Spärck Jones,et al.  Acoustic indexing for multimedia retrieval and browsing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[17]  Mathew Magimai-Doss,et al.  A Study of Phoneme and Grapheme Based Context-Dependent ASR Systems , 2007, MLMI.

[18]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[19]  N. Morgan,et al.  A CTS TASK FOR MEANINGFUL FAST-TURNAROUND EXPERIMENTS , 2015 .

[20]  S. Bengio,et al.  Phoneme-grapheme based speech recognition system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[22]  Samy Bengio,et al.  Joint decoding for phoneme-grapheme continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  R. Cole,et al.  TELEPHONE SPEECH CORPUS DEVELOPMENT AT CSLU , 1998 .

[24]  P. Swiggers,et al.  Gramática de la lengua castellana (1771) de la Real Academia Española@@@Gramatica de la lengua castellana (1771) de la Real Academia Espanola , 1986 .

[25]  Beth Logan,et al.  An experimental study of an audio indexing system for the web , 2000, INTERSPEECH.

[26]  Tanja Schultz,et al.  A Grapheme Based Speech Recognition System for Russian , 2004 .

[27]  Trumpington Street,et al.  A FAST LATTICE-BASED APPROACH TO VOCABULARY INDEPENDENT WORDSPOTTING , 1994 .

[28]  Pietro Laface,et al.  Lexical access to large vocabularies for speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[29]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[30]  Ravi P. Ramachandran,et al.  Modern methods of speech processing , 1995 .

[31]  E. Llorach Gramática de la lengua española , 1994 .

[32]  Sridha Sridharan,et al.  Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting , 2007, IEEE Transactions on Audio, Speech, and Language Processing.