Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop

This paper summarizes the 2010 CLSP Summer Workshop on speech recognition at Johns Hopkins University. The key theme of the workshop was to improve on state-of-the-art speech recognition systems by using Segmental Conditional Random Fields (SCRFs) to integrate multiple types of information. This approach uses a state-of-the-art baseline as a springboard from which to add a suite of novel features including ones derived from acoustic templates, deep neural net phoneme detections, duration models, modulation features, and whole word point-process models. The SCRF framework is able to appropriately weight these different information sources to produce significant gains on both the Broadcast News and Wall Street Journal tasks.

[1]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Hynek Hermansky,et al.  Sparse auto-associative neural networks: theory and application to speech recognition , 2010, INTERSPEECH.

[3]  Patrick Wambacq,et al.  SPRAAK: an open source "SPeech recognition and automatic annotation kit" , 2008, INTERSPEECH.

[4]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[5]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Mark J. F. Gales,et al.  Augmented Statistical Models for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Malcolm Slaney,et al.  Solving Demodulation as an Optimization Problem , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Aren Jansen,et al.  Point Process Models for Spotting Keywords in Continuous Speech , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[12]  Dirk Van Compernolle,et al.  HEAR: an hybrid episodic-abstract speech recognizer , 2009, INTERSPEECH.

[13]  Les E. Atlas,et al.  Time-Frequency Coherent Modulation Filtering of Nonstationary Signals , 2009, IEEE Transactions on Signal Processing.

[14]  Geoffrey Zweig,et al.  SCARF: a segmental conditional random field toolkit for speech recognition , 2010, INTERSPEECH.

[15]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.