Effective Use of Cross-Domain Parsing in Automatic Speech Recognition and Error Detection

Effective Use of Cross-Domain Parsing in Automatic Speech Recognition and Error Detection Marius Alexandru Marin Chair of the Supervisory Committee: Professor Mari Ostendorf Electrical Engineering Automatic speech recognition (ASR), the transcription of human speech into text form, is used in many settings in our society, ranging from customer service applications to personal assistants on mobile devices. In all such settings it is important for the system to know when it is making errors, so that it may ask the user to rephrase or restate their previous utterance. Such errors are often syntactically anomalous. The primary goal of this thesis is to find novel uses of parsing for automatic detection and correction of ASR errors. We start by developing a framework for ASR rescoring and automatic error detection leveraging syntactic parsing in conjunction with a maximum entropy classifier, and find that parsing helps with error detection, even when the parser is trained on out-of-domain data. In particular, features capturing parser reliability are used to improve the detection of out-of-vocabulary (OOV) and name errors. However, parsers trained on out-of-domain treebanks do not provide any benefit to ASR rescoring. This observation motivates our work on domain adaptation of parsing, with the objective of directly improving both transcription accuracy and error detection. We develop two weakly supervised domain adaptation methods which use error labels, but no handannotated parses: a self-training approach to directly improve the probabilistic context-free grammar (PCFG) model used in parsing, as well as a novel model combination method using a discriminative log-linear model to augment the generative PCFG. We apply both methods to ASR rescoring and error detection tasks. We find that self-training improves the ability of our parser to select the correct ASR hypothesis. The log-linear adaptation improves both OOV and name error detection tasks, and self-training performed after loglinear adaptation further improves the reliability of the parser, while producing smaller, faster models. Finally, motivated by empirical observations that the presence of names in an utterance is often indicated by words located far apart from the names themselves, we develop a general long-distance phrase pattern learning algorithm using word-level semantic similarity measures, and apply it to the problem of name error detection. This novel feature learning method leads to more robust classification, both when used independently of parsing, and in conjunction with parse features.

[1]  Wen Wang Weakly supervised training for parsing Mandarin broadcast transcripts , 2008, INTERSPEECH.

[2]  Eric Moulines,et al.  A simulated annealing version of the EM algorithm for non-Gaussian deconvolution , 1997, Stat. Comput..

[3]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[4]  Hermann Ney,et al.  Dynamic programming parsing for context-free grammars in continuous speech recognition , 1991, IEEE Trans. Signal Process..

[5]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[6]  Dong-Hong Ji,et al.  A Semi-Supervised Feature Clustering Algorithm with Application to Word Sense Disambiguation , 2005, HLT.

[7]  Satoshi Nakamura,et al.  Optimal acoustic and language model weights for minimizing word verification errors , 2004, INTERSPEECH.

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Bhuvana Ramabhadran,et al.  Query-by-example Spoken Term Detection For OOV terms , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[11]  Georges Linarès,et al.  Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[12]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[14]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[15]  Rie Kubota Ando,et al.  Exploiting Unannotated Corpora for Tagging and Chunking , 2004, ACL.

[16]  Wei Chen,et al.  Variable-Span out-of-vocabulary named entity detection , 2013, INTERSPEECH.

[17]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[20]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[21]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[22]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[23]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[24]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[25]  Mari Ostendorf,et al.  Learning Phrase Patterns for Text Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[27]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[28]  Sylvain Meignier,et al.  Incorporating named entity recognition into the speech transcription process , 2013, INTERSPEECH.

[29]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[30]  Brian Roark,et al.  Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.

[31]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[32]  Anne Laurent,et al.  Sequential patterns for text categorization , 2006, Intell. Data Anal..

[33]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[34]  R. Bekkerman Distributional Word Clusters vs , 2006 .

[35]  Hagen Soltau,et al.  Out-of-vocabulary word detection in a speech-to-speech translation system , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[38]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[39]  Kenji Sagae Self-Training without Reranking for Parser Domain Adaptation and Its Impact on Semantic Role Labeling , 2010 .

[40]  The robustness of an almost-parsing language model given errorful training data , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Bin Zhang,et al.  Detecting Forum Authority Claims in Online Discussions , 2011 .

[42]  Mari Ostendorf,et al.  Joint reranking of parsing and word recognition with automatic segmentation , 2012, Comput. Speech Lang..

[43]  Brian Roark,et al.  Supervised and unsupervised PCFG adaptation to novel domains , 2003, NAACL.

[44]  Claire Cardie,et al.  Weakly Supervised Natural Language Learning Without Redundant Views , 2003, NAACL.

[45]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[46]  Wen Wang Combining discriminative re-ranking and co-training for parsing Mandarin speech transcripts , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[48]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[50]  Sheryl R. Young,et al.  Detecting misrecognitions and out-of-vocabulary words , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Rie Kubota Ando,et al.  Semantic Lexicon Construction: Learning from Unlabeled Data via Spectral Analysis , 2004, CoNLL.

[52]  Wei Wu,et al.  Detecting targets of alignment moves in multiparty discussions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Shankar Kumar,et al.  Large Scale Language Modeling in Automatic Speech Recognition , 2012, ArXiv.

[54]  Brian Roark,et al.  Probabilistic Top-Down Parsing and Language Modeling , 2001, CL.

[55]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[56]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[57]  Mary P. Harper,et al.  The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources , 2002, EMNLP.

[58]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[59]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[60]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[61]  Wang Huizhen,et al.  Automatic word clustering for text categorization using global information , 2004 .

[62]  Alex Acero,et al.  Maximum Entropy Confidence Estimation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[63]  Jeon Gue Park,et al.  Lattice Rescoring for Speech Recognition using Large Scale Distributed Language Models , 2012, COLING.

[64]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[65]  Mari Ostendorf,et al.  Improving out-of-vocabulary name resolution , 2005, Comput. Speech Lang..

[66]  Richard M. Schwartz,et al.  Automatic Detection Of New Words In A Large Vocabulary Continuous Speech Recognition System , 1989, HLT.

[67]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[68]  Massih-Reza Amini,et al.  Semi Supervised Logistic Regression , 2002, ECAI.

[69]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[71]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[72]  Rie Kubota Ando,et al.  Applying Alternating Structure Optimization to Word Sense Disambiguation , 2006, CoNLL.

[73]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[74]  Ari Rappoport,et al.  ICWSM - A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews , 2010, ICWSM.

[75]  Hynek Hermansky,et al.  Posterior-based out of vocabulary word detection in telephone speech , 2009, INTERSPEECH.

[76]  Mark Dredze,et al.  Efficient Structured Language Modeling for Speech Recognition , 2012, INTERSPEECH.

[77]  Frédéric Béchet,et al.  MACAON An NLP Tool Suite for Processing Word Lattices , 2011, ACL.

[78]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[79]  Mari Ostendorf,et al.  Effective data-driven feature learning for detecting name errors in automatic speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[80]  Brian Roark,et al.  Corrective language modeling for large vocabulary ASR with the perceptron algorithm , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[81]  Mari Ostendorf,et al.  Compensating for Word Posterior Estimation Bias in Confusion Networks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[82]  Elmar Nöth,et al.  Semantic processing of out-of-vocabulary words in a spoken dialogue system , 1997, EUROSPEECH.

[83]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[84]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[85]  Rohit Kumar,et al.  Detecting OOV Named-Entities in Conversational Speech , 2012, INTERSPEECH.

[86]  Katsuhito Sudoh,et al.  Incorporating Speech Recognition Confidence into Discriminative Named Entity Recognition of Speech Data , 2006, ACL.

[87]  Hermann Ney,et al.  A comparison of word graph and n-best list based confidence measures , 1999, EUROSPEECH.

[88]  Herbert Gish,et al.  Evaluation of word confidence for speech recognition systems , 1999, Comput. Speech Lang..

[89]  Bhiksha Raj,et al.  A boosting approach for confidence scoring , 2001, INTERSPEECH.

[90]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[91]  Mari Ostendorf,et al.  Using syntactic and confusion network structure for out-of-vocabulary word detection , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[92]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[93]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[94]  Mary P. Harper,et al.  SParseval: Evaluation Metrics for Parsing Speech , 2006, LREC.

[95]  Mark Dredze,et al.  OOV Sensitive Named-Entity Recognition in Speech , 2011, INTERSPEECH.

[96]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[97]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[98]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[99]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[100]  Ming Zhou,et al.  Detecting Erroneous Sentences using Automatically Mined Sequential Patterns , 2007, ACL.

[101]  Mari Ostendorf,et al.  Automatic sentence structure annotation for spoken language processing , 2008 .

[102]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[103]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[104]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[105]  Constantine Kotropoulos,et al.  Long distance bigram models applied to word clustering , 2011, Pattern Recognit..

[106]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[107]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[108]  Frédéric Béchet,et al.  Automatically enriching spoken corpora with syntactic information for linguistic studies , 2014, LREC.

[109]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[110]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[111]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[112]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[113]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[114]  Larry Gillick,et al.  A probabilistic approach to confidence estimation and evaluation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[115]  Mari Ostendorf,et al.  Robust information extraction from automatically generated speech transcriptions , 2000, Speech Commun..

[116]  Evgeniy Gabrilovich,et al.  Harnessing the Expertise of 70, 000 Human Editors: Knowledge-Based Feature Generation for Text Categorization , 2007, J. Mach. Learn. Res..

[117]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[118]  Julia Hirschberg,et al.  Towards Natural Clarification Questions in Dialogue Systems , 2014 .

[119]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[120]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[121]  Ruhi Sarikaya,et al.  Learning phrase patterns for text classification using a knowledge graph and unlabeled data , 2014, INTERSPEECH.

[122]  Chalapathy Neti,et al.  Word-based confidence measures as a guide for stack search in speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[123]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[124]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[125]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[126]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[127]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[128]  Emily M. Bender,et al.  Detecting authority bids in online discussions , 2010, 2010 IEEE Spoken Language Technology Workshop.

[129]  Frédéric Béchet,et al.  “Can you give me another word for hyperbaric?”: Improving speech translation using targeted clarification questions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[130]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[131]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[132]  Kristin Precoda,et al.  Recent advances in SRI'S IraqComm™ Iraqi Arabic-English speech-to-speech translation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[133]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[134]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[135]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[136]  Kristin Precoda,et al.  Implementing SRI's Pashto speech-to-speech translation system on a smart phone , 2010, 2010 IEEE Spoken Language Technology Workshop.

[137]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[138]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[139]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[140]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[141]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[142]  Mari Ostendorf,et al.  Domain adaptation for parsing in automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[143]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[144]  David R. Traum,et al.  A reranking approach for recognition and classification of speech input in conversational dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[145]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[146]  Ellen Riloff,et al.  Creating Subjective and Objective Sentence Classifiers from Unannotated Texts , 2005, CICLing.

[147]  Hui Han,et al.  Rule-based word clustering for text classification , 2003, SIGIR '03.