An evaluation of posterior modeling techniques for phonetic recognition

Several methods have been proposed recently for modeling posterior representations derived from local classifiers [1, 2]. In recent work, Sainath et al. have proposed the use of a tied-mixture-based posterior modeling approach [3] to enhance exemplar-based posterior representations for phone recognition tasks. In this work, we conduct a detailed evaluation to determine the effectiveness of this technique on three representative posterior systems. In addition, we propose and evaluate an alternative discriminative formulation of the posterior modeling objective function that seeks to minimize framelevel errors. In experimental evaluations on the TIMIT corpus, we find that posterior modeling results in relative phone error rate (PER) reductions of between 1.1-5.5% across the systems tested. In fact, using Spif-NN [4, 3] posteriors, we are able to achieve a PER of 18.5; to the best of our knowledge, this is the best result reported in the literature to date. minimize framelevel errors.

[1]  Tara N. Sainath,et al.  Exemplar-based Sparse Representation phone identification features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[3]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[5]  Tara N. Sainath,et al.  Enhancing Exemplar-Based Posteriors for Speech Recognition Tasks , 2012, INTERSPEECH.

[6]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[7]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[8]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Eric Fosler-Lussier,et al.  Investigations into the Crandem Approach to Word Recognition , 2010, HLT-NAACL.

[10]  Sanjeev Khudanpur,et al.  Dirichlet Mixture Models of neural net posteriors for HMM-based speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[12]  Eric Fosler-Lussier,et al.  CRANDEM: conditional random fields for word recognition , 2009, INTERSPEECH.

[13]  Guillermo Aradilla Acoustic Models for Posterior Features in Speech Recognition , 2008 .

[14]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..