Joint Modeling of Text and Acoustic-Prosodic Cues for Neural Parsing

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing a spoken utterance, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and word-based prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together improve parse F1 scores significantly over a strong text-only baseline. For this study with known sentence boundaries, error analysis shows that the main benefit of acoustic-prosodic features is in sentences with disfluencies and that attachment errors are most improved.

[1]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Mari Ostendorf,et al.  Disfluency Detection Using a Bidirectional LSTM , 2016, INTERSPEECH.

[5]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[7]  Eugene Charniak,et al.  Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Kallirroi Georgila Using Integer Linear Programming for Detecting Speech Disfluencies , 2009, HLT-NAACL.

[10]  Yang Liu,et al.  Disfluency Detection Using Multi-step Stacked Learning , 2013, NAACL.

[11]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[12]  Eugene Charniak,et al.  A TAG-based noisy-channel model of speech repairs , 2004, ACL.

[13]  Mari Ostendorf,et al.  Unediting: Detecting Disfluencies Without Careful Transcripts , 2015, HLT-NAACL.

[14]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[15]  Mohammad Sadegh Rasooli,et al.  Joint Parsing and Disfluency Detection in Linear Time , 2013, EMNLP.

[16]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[17]  Izhak Shafran,et al.  Exploiting prosody for PCFGs with latent annotations , 2007, INTERSPEECH.

[18]  Mary P. Harper,et al.  Appropriately Handled Prosodic Breaks Help PCFG Parsing , 2010, HLT-NAACL.

[19]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[20]  Mark Steedman,et al.  The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue , 2010, Lang. Resour. Evaluation.

[21]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[22]  Eugene Charniak,et al.  Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does , 2004, NAACL.

[23]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[24]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[25]  P. Keating,et al.  Articulatory strengthening at edges of prosodic domains. , 1997, The Journal of the Acoustical Society of America.

[26]  Matthew Lease,et al.  Effective Use of Prosody in Parsing Conversational Speech , 2005, HLT.

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  Mark Johnson,et al.  Joint Incremental Disfluency Detection and Dependency Parsing , 2014, TACL.

[29]  Mary P. Harper,et al.  PCFGs with Syntactic and Prosodic Indicators of Speech Repairs , 2006, ACL.

[30]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[31]  Harlan Lane,et al.  The patterns of silence: Performance structures in sentence production , 1979, Cognitive Psychology.

[32]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[33]  Dan Klein,et al.  Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output , 2012, EMNLP.

[34]  Mari Ostendorf,et al.  Joint reranking of parsing and word recognition with automatic segmentation , 2012, Comput. Speech Lang..

[35]  Stefanie Shattuck-Hufnagel,et al.  The Use of Prosody in Syntactic Disambiguation , 1991, HLT.

[36]  Dan Klein,et al.  Disfluency Detection with a Semi-Markov Model and Prosodic Features , 2015, HLT-NAACL.