Document centered approach to text normalization

In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.

[1]  Michael Riley,et al.  Some Applications of Tree-based Modelling to Speech and Language , 1989, HLT.

[2]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[4]  Renato De Mori,et al.  A cache based natural lan-guage model for speech recognition , 1992 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[7]  Lynette Hirschman,et al.  MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[8]  Inderjeet Mani,et al.  Identifying Unknown Proper Names in Newswire Text , 1996 .

[9]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[10]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[11]  Breck Baldwin,et al.  EAGLE: An Extensible Architecture for General Linguistic Engineering , 1997, ANLP.

[12]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Ronald Rosenfeld,et al.  Nonlinear interpolation of topic models for language model adaptation , 1998, ICSLP.

[14]  Andrei Mikheev A Knowledge-free Method for Capitalized Word Disambiguation , 1999, ACL.