论文信息 - Joint Modeling of Text and Acoustic-Prosodic Cues for Neural Parsing - 字舞流文

Joint Modeling of Text and Acoustic-Prosodic Cues for Neural Parsing

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing a spoken utterance, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and word-based prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together improve parse F1 scores significantly over a strong text-only baseline. For this study with known sentence boundaries, error analysis shows that the main benefit of acoustic-prosodic features is in sentences with disfluencies and that attachment errors are most improved.

Mari Ostendorf | Trang Tran | Kevin Gimpel | Karen Livescu | Mohit Bansal | Shubham Toshniwal

[1] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Quoc V. Le,et al. Multi-task Sequence to Sequence Learning , 2015, ICLR.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Mari Ostendorf,et al. Disfluency Detection Using a Bidirectional LSTM , 2016, INTERSPEECH.

[5] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[7] Eugene Charniak,et al. Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[8] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[9] Kallirroi Georgila. Using Integer Linear Programming for Detecting Speech Disfluencies , 2009, HLT-NAACL.

[10] Yang Liu,et al. Disfluency Detection Using Multi-step Stacked Learning , 2013, NAACL.

[11] Dan Klein,et al. Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[12] Eugene Charniak,et al. A TAG-based noisy-channel model of speech repairs , 2004, ACL.

[13] Mari Ostendorf,et al. Unediting: Detecting Disfluencies Without Careful Transcripts , 2015, HLT-NAACL.

[14] Elisabeth Schriberg,et al. Preliminaries to a Theory of Speech Disfluencies , 1994 .

[15] Mohammad Sadegh Rasooli,et al. Joint Parsing and Disfluency Detection in Linear Time , 2013, EMNLP.

[16] Christopher Kermorvant,et al. Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[17] Izhak Shafran,et al. Exploiting prosody for PCFGs with latent annotations , 2007, INTERSPEECH.

[18] Mary P. Harper,et al. Appropriately Handled Prosodic Breaks Help PCFG Parsing , 2010, HLT-NAACL.

[19] Colin W. Wightman,et al. Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[20] Mark Steedman,et al. The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue , 2010, Lang. Resour. Evaluation.

[21] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[22] Eugene Charniak,et al. Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does , 2004, NAACL.

[23] Yoshua Bengio,et al. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[24] Mari Ostendorf,et al. TOBI: a standard for labeling English prosody , 1992, ICSLP.

[25] P. Keating,et al. Articulatory strengthening at edges of prosodic domains. , 1997, The Journal of the Acoustical Society of America.

[26] Matthew Lease,et al. Effective Use of Prosody in Parsing Conversational Speech , 2005, HLT.

[27] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[28] Mark Johnson,et al. Joint Incremental Disfluency Detection and Dependency Parsing , 2014, TACL.

[29] Mary P. Harper,et al. PCFGs with Syntactic and Prosodic Indicators of Speech Repairs , 2006, ACL.

[30] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[31] Harlan Lane,et al. The patterns of silence: Performance structures in sentence production , 1979, Cognitive Psychology.

[32] Geoffrey E. Hinton,et al. Grammar as a Foreign Language , 2014, NIPS.

[33] Dan Klein,et al. Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output , 2012, EMNLP.

[34] Mari Ostendorf,et al. Joint reranking of parsing and word recognition with automatic segmentation , 2012, Comput. Speech Lang..

[35] Stefanie Shattuck-Hufnagel,et al. The Use of Prosody in Syntactic Disambiguation , 1991, HLT.

[36] Dan Klein,et al. Disfluency Detection with a Semi-Markov Model and Prosodic Features , 2015, HLT-NAACL.