Normalization of non-standard words

In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types?news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.

[1]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[2]  Garland Cannon Abbreviations and Acronyms in English Word-Formation , 1989 .

[3]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[6]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[7]  H. Günther,et al.  Schrift und Schriftlichkeit / Writing and Its Use, Part 1 , 1994 .

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Neil C. Rowe,et al.  Semiautomatic Disabbreviation of Technical Text , 1995, Inf. Process. Manag..

[10]  Richard Sproat,et al.  Compilation of Weighted Finite-State Transducers from Decision Trees , 1996, ACL.

[11]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[12]  Aravind K. Joshi,et al.  34th Annual Meeting of the Association for Computational Linguistics , 1996 .

[13]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[14]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[15]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[16]  David Yarowsky,et al.  Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[17]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[18]  Andreas Stolcke,et al.  Switchboard Discourse Language Modeling Project (Final Report) , 1997 .

[19]  Giovanni Flammia,et al.  Discourse segmentation of spoken dialogue: an empirical approach , 1998 .

[20]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[21]  R. Sproat,et al.  Multilingual text-to-speech synthesis : the Bell Labs approach , 1998 .

[22]  Kazem Taghva,et al.  Recognizing acronyms and their definitions , 1999, International Journal on Document Analysis and Recognition.

[23]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[24]  Alan W. Black,et al.  Non-standard word and homograph resolution for asian language text analysis , 2000, INTERSPEECH.

[25]  Richard Sproat,et al.  Book Reviews: A Computational Theory of Writing Systems , 2006, CL.