A survey of types of text noise and techniques to handle noisy text

Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. In this paper, we present a survey of the existing measures for noise in text. We also cover application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction.

[1]  T. J. Watson Summarizing Noisy Documents Hongyan Jing Daniel Lopresti Chilin Shih IBM , 2003 .

[2]  Lina Zhou,et al.  Error Detection Using Linguistic Features , 2005, HLT/EMNLP.

[3]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[4]  Yuji Matsumoto,et al.  Automatic Construction of Machine Translation Knowledge Using Translation Literalness , 2003, EACL.

[5]  Mari Ostendorf,et al.  Improving Information Extraction by Modeling Errors in Speech Recognizer Output , 2001, HLT.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[8]  L. Venkata Subramaniam,et al.  SMS based Interface for FAQ Retrieval , 2009, ACL.

[9]  Ulrich Kressel,et al.  Categorizing Paper Documents: A Generic System for Domain and Language Independent Text Categorization , 1998, Comput. Vis. Image Underst..

[10]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[11]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[12]  Shourya Roy,et al.  How Much Noise Is Too Much: A Study in Automatic Text Classification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[14]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[15]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[16]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[17]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[18]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[19]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[20]  Shourya Roy,et al.  Automatic Generation of Domain Models for Call-Centers from Noisy Transcriptions , 2006, ACL.

[21]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[22]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[23]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[24]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[25]  Eiichiro Sumita,et al.  Bilingual corpus cleaning focusing on translation literality , 2002, INTERSPEECH.

[26]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Satoshi Takahashi,et al.  Rejection of out-of-vocabulary words using phoneme confidence likelihood , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[29]  L. Venkata Subramaniam,et al.  Business Intelligence from Voice of Customer , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  C. Uhrik,et al.  Confidence metrics based on n-gram language model backoff behaviors , 1997, EUROSPEECH.

[31]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[32]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[33]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[34]  Daniel P. Lopresti,et al.  Optical character recognition errors and their effects on natural language processing , 2008, AND '08.

[35]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  Ganesh Ramakrishnan,et al.  Identification of class specific discourse patterns , 2008, CIKM '08.