Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon

We present a framework for discovering acoustic units and generating an associated pronunciation lexicon from an initial grapheme-based recognition system. Our approach consists of two distinct contributions. First, context-dependent grapheme models are clustered using a spectral clustering approach to create a set of phone-like acoustic units. Next, we transform the pronunciation lexicon using a statistical machine translation-based approach. Pronunciation hypotheses generated from a decoding of the training set are used to create a phrase-based translation table. We propose a novel method for scoring the phrase-based rules that significantly improves the output of the transformation process. Results on an English language dataset demonstrate the combined methods provide a 13% relative reduction in word error rate compared to a baseline grapheme-based system. Our approach could potentially be applied to low-resource languages without existing lexicons, such as in the Babel project.

[1]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[3]  Kai Feng,et al.  Approaches to automatic lexicon learning with limited training examples , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[5]  Paul Deléglise,et al.  Grapheme to phoneme conversion using an SMT system , 2009, INTERSPEECH.

[6]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[7]  Byung-Jun Yoon,et al.  A Novel Low-Complexity HMM Similarity Measure , 2011, IEEE Signal Processing Letters.

[8]  Steve Young,et al.  The HTK book , 1995 .

[9]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[11]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[15]  W. Marsden I and J , 2012 .

[16]  Mari Ostendorf,et al.  Joint lexicon, acoustic unit inventory and model design , 1999, Speech Commun..

[17]  Lori Lamel,et al.  Pronunciation variants generation using SMT-inspired approaches , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  José Carlos Príncipe,et al.  Closed-form cauchy-schwarz PDF divergence for mixture of Gaussians , 2011, The 2011 International Joint Conference on Neural Networks.

[19]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..