Lessons Learned in Part-of-Speech Tagging of Conversational Speech

This paper examines tagging models for spontaneous English speech transcripts. We analyze the performance of state-of-the-art tagging models, either generative or discriminative, left-to-right or bidirectional, with or without latent annotations, together with the use of ToBI break indexes and several methods for segmenting the speech transcripts (i.e., conversation side, speaker turn, or human-annotated sentence). Based on these studies, we observe that: (1) bidirectional models tend to achieve better accuracy levels than left-to-right models, (2) generative models seem to perform somewhat better than discriminative models on this task, and (3) prosody improves tagging performance of models on conversation sides, but has much less impact on smaller segments. We conclude that, although the use of break indexes can indeed significantly improve performance over baseline models without them on conversation sides, tagging accuracy improves more by using smaller segments, for which the impact of the break indexes is marginal.

[1]  Ann Cutler,et al.  Prosody in the Comprehension of Spoken Language: A Literature Review , 1997, Language and speech.

[2]  Mary P. Harper,et al.  2005 Johns Hopkins Summer Workshop Final Report on Parsing and Spoken Structural Event Detection , 2005 .

[3]  Elmar Nöth,et al.  Integrated recognition of words and prosodic phrase boundaries , 2002, Speech Commun..

[4]  Mary P. Harper,et al.  A Joint Language Model With Fine-grain Syntactic Tags , 2009, EMNLP.

[5]  Izhak Shafran,et al.  Exploiting prosody for PCFGs with latent annotations , 2007, INTERSPEECH.

[6]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[7]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[8]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[9]  Mary P. Harper,et al.  A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[10]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[11]  Jennifer Foster "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[12]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[13]  Mary P. Harper,et al.  Appropriately Handled Prosodic Breaks Help PCFG Parsing , 2010, HLT-NAACL.

[14]  Mary P. Harper,et al.  Linguistic Resources for Speech Parsing , 2006, LREC.

[15]  Dilek Z. Hakkani-Tür,et al.  IMPACT OF AUTOMATIC COMMA PREDICTION ON POS/NAME TAGGING OF SPEECH , 2006, 2006 IEEE Spoken Language Technology Workshop.

[16]  Mary P. Harper,et al.  Reranking for Sentence Boundary Detection in Conversational Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Matthew Lease,et al.  Effective Use of Prosody in Parsing Conversational Speech , 2005, HLT.

[19]  Andreas Stolcke,et al.  Using Conditional Random Fields for Sentence Boundary Detection in Speech , 2005, ACL.

[20]  Peter A. Heeman,et al.  POS Tags and Decision Trees for Language Modeling , 1999, EMNLP.

[21]  Elmar Nöth,et al.  Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground , 2001, INTERSPEECH.

[22]  Eugene Charniak,et al.  Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does , 2004, NAACL.

[23]  Mari Ostendorf,et al.  PROSODY MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[24]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[25]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[26]  Elmar Nöth,et al.  P rosodic models and speech synthesis: towards the common ground * , 2000 .

[27]  Slav Petrov,et al.  Products of Random Latent Variable Grammars , 2010, NAACL.

[28]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[30]  Mary P. Harper,et al.  Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training , 2009, NAACL.

[31]  Jeung-Yoon Choi,et al.  Simultaneous recognition of words and prosody in the Boston University Radio Speech Corpus , 2005, Speech Commun..