How Much Noise Is Too Much: A Study in Automatic Text Classification

Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.

[1]  Yiming Yang,et al.  Robustness of regularized linear classification methods in text categorization , 2003, SIGIR.

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  David D. Palmer,et al.  Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Kazem Taghva,et al.  OCR correction based on document level knowledge , 2003, IS&T/SPIE Electronic Imaging.

[8]  Shourya Roy,et al.  Automatic Generation of Domain Models for Call-Centers from Noisy Transcriptions , 2006, ACL.

[9]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[10]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[11]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[12]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[13]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[16]  Shourya Roy,et al.  Adding sentence boundaries to conversational speech transcriptions using noisily labelled examples , 2007 .

[17]  Andreas Vlachos,et al.  Active Annotation , 2022 .

[18]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[19]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[20]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Ulrich Kressel,et al.  Categorizing Paper Documents: A Generic System for Domain and Language Independent Text Categorization , 1998, Comput. Vis. Image Underst..

[23]  Craig A. Knoblock,et al.  Semantic annotation of unstructured and ungrammatical text , 2005, IJCAI.

[24]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.