Semantic and phonetic automatic reconstruction of medical dictations

Automatic speech recognition (ASR) has become a valuable tool in large document production environments like medical dictation. While manual post-processing is still needed for correcting speech recognition errors and for creating documents which adhere to various stylistic and formatting conventions, a large part of the document production process is carried out by the ASR system. For improving the quality of the system output, knowledge about the multi-layered relationship between the dictated texts and the final documents is required. Thus, typical speech-recognition errors can be avoided, and proper style and formatting can be anticipated in the ASR part of the document production process. Yet - while vast amounts of recognition results and manually edited final reports are constantly being produced - the error-free literal transcripts of the actually dictated texts are a scarce and costly resource because they have to be created by manually transcribing the audio files. To obtain large corpora of literal transcripts for medical dictation, we propose a method for automatically reconstructing them from draft speech-recognition transcripts plus the corresponding final medical reports. The main innovative aspect of our method is the combination of two independent knowledge sources: phonetic information for the identification of speech-recognition errors and semantic information for detecting post-editing concerning format and style. Speech recognition results and final reports are first aligned, then properly matched based on semantic and phonetic similarity, and finally categorised and selectively combined into a reconstruction hypothesis. This method can be used for various applications in language technology, e.g., adaptation for ASR, document production, or generally for the development of parallel text corpora of non-literal text resources. In an experimental evaluation, which also includes an assessment of the quality of the reconstructed transcripts compared to manual transcriptions, the described method results in a relative word error rate reduction of 7.74% after retraining the standard language model with reconstructed transcripts.

[1]  Kimberly Voll,et al.  A methodology of error detection: improving speech recognition in radiology , 2006 .

[2]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[3]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[4]  Michael Hammond Syllable parsing in English and French , 1995, ArXiv.

[5]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[6]  Eric Fosler-Lussier,et al.  A framework for predicting speech recognition errors , 2005, Speech Commun..

[7]  Karim Filali,et al.  A Dynamic Bayesian Framework to Model Context and Memory in Edit Distance Learning: An Application to Pronunciation Classification , 2005, ACL.

[8]  Olivier Bodenreider,et al.  Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System , 2001 .

[9]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[10]  Gernot Kubin,et al.  Reconstructing Medical Dictations from Automatically Recognized and Non-Literal Transcripts with Phonetic Similarity Matching , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Yearbook of Medical Informatics.

[12]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[13]  Franz Pernkopf,et al.  Language model adaptation for medical dictations by automatic phonetics-driven transcript reconstruction , 2008 .

[14]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[15]  Serguei V. S. Pakhomov,et al.  Generating Training Data for Medical Dictations , 2001, NAACL.

[16]  Jeremy Jancsary,et al.  Semantics-based Automatic Literal Reconstruction Of Dictations , 2007 .

[17]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[18]  Peder A. Olsen,et al.  Theory and practice of acoustic confusability , 2002, Comput. Speech Lang..

[19]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[20]  Jochen Peters,et al.  Transformation-based error correction for speech-to-text systems , 2004, INTERSPEECH.

[21]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  P. Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar , 2004 .

[24]  Franz Pernkopf,et al.  Automatic phonetics-driven reconstruction of medical dictations on multiple levels of segmentation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Jeremy Jancsary,et al.  Mismatch interpretation by semantics-driven alignment ∗ , 2006 .

[26]  Richard Shillcock,et al.  Proceedings of EUROSPEECH-1991. , 1991 .

[27]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[28]  Gosse Bouma WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh 3-4 June, 2001 , 2001 .

[29]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.