Recognize , Categorize , and Retrieve

A successful text categorization experiment divides a textual collection into pre-defined classes. A true representative for each class is generally obtained during training of the categorizer. In this paper, we report on our experiments on training and categorization of optically recognized documents. In particular, we will address the issues regarding the effects OCR errors may have on training, dimensionality reduction, and categorization. We further report on ways that categorization may help error correction and retrieval effectiveness.

[1]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[2]  Kazem Taghva,et al.  MANICURE document processing system , 1998, Electronic Imaging.

[3]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[6]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[7]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[8]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Kazem Taghva,et al.  Evaluating text categorization in the presence of OCR errors , 2000, IS&T/SPIE Electronic Imaging.

[11]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[12]  Isabelle Moulinier,et al.  Applying an existing machine learning algorithm to text categorization , 1995, Learning for Natural Language Processing.

[13]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[14]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.