An investigation of linguistic information for speech recognition error detection

After several decades of effort, significant progress has been made in the area of speech recognition technologies, and various speech-based applications have been developed. However, current speech recognition systems still generate erroneous output, which hinders the wide adoption of speech applications. Given that the goal of error-free output can not be realized in near future, mechanisms for automatically detecting and even correcting speech recognition errors may prove useful for amending imperfect speech recognition systems. This dissertation research focuses on the automatic detection of speech recognition errors for monologue applications, and in particular, dictation applications. Due to computational complexity and efficiency concerns, limited linguistic information is embedded in speech recognition systems. Furthermore, when identifying speech recognition errors, humans always apply linguistic knowledge to complete the task. This dissertation therefore investigates the effect of linguistic information on automatic error detection by applying two levels of linguistic analysis, specifically syntactic analysis and semantic analysis, to the post processing of speech recognition output. Experiments are conducted on two dictation corpora which differ in both topic and style (daily office communication by students and Wall Street Journal news by journalists). To catch grammatical abnormalities possibly caused by speech recognition errors, two sets of syntactic features, linkage information and word associations based on syntactic dependency, are extracted for each word from the output of two lexicalized robust syntactic parsers respectively. Confidence measures, which combine features using Support Vector Machines, are used to detect speech recognition errors. A confidence measure that combines syntactic features with non-linguistic features yields consistent performance improvement in one or more aspects over those obtained by using non-linguistic features alone. Semantic abnormalities possibly caused by speech recognition errors are caught by the analysis of semantic relatedness of a word to its context. Two different methods are used to integrate semantic analysis with syntactic analysis. One approach addresses the problem by extracting features for each word from its relations to other words. To this end, various WordNet-based measures and different context lengths are examined. The addition of semantic features in confidence measures can further yield small but consistent improvement in error detection performance. The other approach applies lexical cohesion analysis by taking both reiteration and collocation relationships into consideration and by augmenting words with probability predicted from syntactic analysis. Two WordNet-based measures and one measure based on Latent Semantic Analysis are used to instantiate lexical cohesion relationships. Additionally, various word probability thresholds and cosine similarity thresholds are examined. The incorporation of lexical cohesion analysis is superior to the use of syntactic analysis alone. In summary, the use of linguistic information as described, including syntactic and semantic information, can provide positive impact on automatic detection of speech recognition errors.

[1]  Stephen Cox,et al.  High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[2]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[3]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[4]  Michael Halliday,et al.  Cohesion in English , 1976 .

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  John D. Lafferty,et al.  A Robust Parsing Algorithm for Link Grammars , 1995, IWPT.

[7]  David D. Palmer,et al.  Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[8]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[9]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[10]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[11]  Patrick Pantel,et al.  An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words , 2000, ACL.

[12]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[13]  Lin Lawrence Chase,et al.  Word and acoustic confidence annotation for large vocabulary speech recognition , 1997, EUROSPEECH.

[14]  Bernhard Suhm,et al.  Multimodal interactive error recovery for non-conversational speech user interfaces , 1999 .

[15]  Lin Lawrance Chase Error-responsive feedback mechanisms for speech recognizers , 1997 .

[16]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[17]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[18]  Clare-Marie Karat,et al.  Productivity, satisfaction, and interaction strategies of individuals with spinal cord injuries and traditional users interacting with speech recognition software , 2001, Universal Access in the Information Society.

[19]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[21]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[22]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[24]  Susan T. Dumais,et al.  The latent semantic analysis theory of knowledge , 1997 .

[25]  Andrew Sears,et al.  Discovering Cues to Error Detection in Speech Recognition Output: A User-Centered Approach , 2006, J. Manag. Inf. Syst..

[26]  Jerome R. Bellegarda Latent Semantic Language Modeling for Speech Recognition , 2004 .

[27]  Mukund Padmanabhan,et al.  Error corrective mechanisms for speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[30]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[31]  Sheryl R. Young,et al.  Detecting misrecognitions and out-of-vocabulary words , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Wayne H. Ward,et al.  Estimating semantic confidence for spoken dialogue systems , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[34]  Thomas K. Landauer,et al.  On the computational basis of learning and cognition: Arguments from LSA , 2002 .

[35]  Mark A. Randolph,et al.  A support vector machines-based rejection technique for speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Jean-Luc Gauvain,et al.  Unsupervised language model adaptation for broadcast news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[37]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[38]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[39]  Andrew Sears,et al.  Using confidence scores to improve hands-free speech based navigation in continuous dictation systems , 2004, TCHI.

[40]  Eric K. Ringger,et al.  A fertility channel model for post-correction of continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[42]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[43]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[44]  Sadaoki Furui Toward Spontaneous Speech Recognition and Understanding , 2002 .

[45]  John Lafferty,et al.  Grammatical Trigrams: A Probabilistic Model of Link Grammar , 1992 .

[46]  Wayne A. Lea,et al.  The value of speech recognition systems , 1990 .

[47]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[48]  Mary P. Harper,et al.  The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources , 2002, EMNLP.

[49]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[50]  Andrew Sears,et al.  Data mining for detecting errors in dictation speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[51]  Andreas Wendemuth,et al.  Advances in confidence measures for large vocabulary , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[52]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[53]  Rafid A. Sukkar,et al.  Correcting recognition errors via discriminative utterance verification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[54]  James R. Glass,et al.  Confidence scoring for speech understanding systems , 1998, ICSLP.

[55]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[56]  Ciprian Chelba,et al.  Exploiting Syntactic Structure for Natural Language Modeling , 2000, ArXiv.

[57]  Benoît Maison,et al.  Robust confidence annotation and rejection for continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[58]  José B. Mariño,et al.  Contextual confidence measures for continuous speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[59]  Eric K. Ringger,et al.  Error correction via a post-processor for continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[60]  Helen M. Meng,et al.  A two-level schema for detecting recognition errors , 2004, INTERSPEECH.

[61]  Douglas J. Nelson,et al.  Separation of non-spontaneous and spontaneous speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[62]  Jerome R. Bellegarda,et al.  A multispan language modeling framework for large vocabulary speech recognition , 1998, IEEE Trans. Speech Audio Process..

[63]  Wayne H. Ward,et al.  Confidence measures for spoken dialogue systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[64]  Hermann Ney,et al.  Using posterior word probabilities for improved speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[65]  Ido Dagan,et al.  Contextual Word Similarity and Estimation from Sparse Data , 1993, ACL.

[66]  Torbjörn Lager The µ-TBL System: Logic Programming Tools for Transformation-Based Learning , 1999, CoNLL.

[67]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[68]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[69]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[70]  Lynn A. Streeter,et al.  Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval , 1989, Inf. Process. Manag..

[71]  Hitoshi Iida,et al.  A Method for Correcting Errors in Speech Recognition Using the Statistical Features of Character Co-occurence , 1998, COLING-ACL.

[72]  Andreas Stolcke,et al.  Structure and performance of a dependency language model , 1997, EUROSPEECH.

[73]  Kathleen McKeown,et al.  Improving Word Sense Disambiguation in Lexical Chaining , 2003, IJCAI.

[74]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[75]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[76]  Brian Roark,et al.  Probabilistic Top-Down Parsing and Language Modeling , 2001, CL.

[77]  Wai Kit Lo,et al.  A multi-pass error detection and correction framework for Mandarin LVCSR , 2006, INTERSPEECH.

[78]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[79]  Douglas E. Appelt,et al.  Combining Linguistic and Statistical Knowledge Sources in Natural-Language Processing for ATIS , 1995 .

[80]  Jens Edlund,et al.  Early error detection on word level , 2004 .

[81]  Providen e RIe Immediate-Head Parsing for Language Models , 2001 .

[82]  Larry Gillick,et al.  A probabilistic approach to confidence estimation and evaluation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[83]  Dan I. Moldovan,et al.  Lexical Chains for Question Answering , 2002, COLING.

[84]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[85]  Gary Geunbae Lee,et al.  Speech recognition error correction using maximum entropy language model , 2004, INTERSPEECH.

[86]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[87]  Michael Hoey,et al.  Patterns of Lexis In Text , 1991 .

[88]  Clare-Marie Karat,et al.  Hands-Free, Speech-Based Navigation During Dictation: Difficulties, Consequences, and Solutions , 2003, Hum. Comput. Interact..

[89]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[90]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[91]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[92]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[93]  Andreas Stolcke,et al.  The use of a linguistically motivated language model in conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[94]  Diana Inkpen,et al.  Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts , 2005, HLT.

[95]  Ralph Grishman,et al.  NYU Language Modeling Experiments for the 1995 CSR Evaluation , 1995 .

[96]  Michael Picheny,et al.  Word level confidence measurement using semantic features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[97]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[98]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[99]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[100]  Vassilios Digalakis,et al.  Combining Knowledge Sources to Reorder N-Best Speech Hypothesis Lists , 1994, HLT.

[101]  Rong Zhang,et al.  Is this conversation on track? , 2001, INTERSPEECH.

[102]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[103]  John D. Lafferty,et al.  Inference and Estimation of a Long-Range Trigram Model , 1994, ICGI.

[104]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[105]  Frederick Jelinek,et al.  Exploiting Syntactic Structure for Language Modeling , 1998, ACL.

[106]  Ronald Rosenfeld,et al.  Nonlinear interpolation of topic models for language model adaptation , 1998, ICSLP.

[107]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[108]  Patrick Wambacq,et al.  Confidence scoring based on backward language models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[109]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[110]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[111]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[112]  Nicola Stokes,et al.  Applications of Lexical Cohesion Analysis in the Topic Detection and Tracking Domain , 2004 .

[113]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[114]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[115]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[116]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[117]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.