Impact of online handwriting recognition performance on text categorization

Today, there is an increasing demand of efficient archival and retrieval methods for online handwritten data. For such tasks, text categorization is of particular interest. The textual data available in online documents can be extracted through online handwriting recognition; however, this process produces errors in the resulting text. This work reports experiments on the categorization of online handwritten documents based on their textual contents. We analyze the effect of word recognition errors on the categorization performances, by comparing the performances of a categorization system with the texts obtained through online handwriting recognition and the same texts available as ground truth. Two well-known categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2,000 handwritten documents has been collected for this study. Results show that classification rate loss is not significant, and precision loss is only significant for recall values of 60–80% depending on the noise levels.

[1]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[2]  Norbert Fuhr,et al.  Searching Structured Documents with the Enhanced Retrieval Functionality of freeWAIS-sf and SFgate , 1995, Comput. Networks ISDN Syst..

[3]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[4]  Wataru Ohyama,et al.  The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification , 2006, Document Analysis Systems.

[5]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[6]  Eamonn J. Keogh Instance-Based Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[7]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[8]  M. P. Perrone,et al.  Handwritten document retrieval , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[9]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Joshua Alspector,et al.  A Line-Oriented Approach to Word Spotting in Handwritten Documents , 2000, Pattern Analysis & Applications.

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Rainer Hoch,et al.  An experimental evaluation of OCR text representations for learning document classifiers , 1998, International Journal on Document Analysis and Recognition.

[14]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[15]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[18]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[22]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[23]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[24]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[25]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[26]  Venu Govindaraju,et al.  Automatic recognition of handwritten medical forms for search engines , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[27]  Karen Spärck Jones Experiments in relevance weighting of search terms , 1979, Inf. Process. Manag..

[28]  Guillaume Koch Catégorisation automatique de documents manuscrits : Application aux courriers entrants , 2006 .

[29]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[30]  Edward M Marcotte,et al.  LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. , 2004, Journal of molecular biology.

[31]  Ergina Kavallieratou,et al.  Retrieval of historical documents by word spotting , 2009, Electronic Imaging.

[32]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[33]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[34]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Giovanni Soda,et al.  Indexing and retrieval of words in old documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[36]  Kazem Taghva,et al.  Evaluating text categorization in the presence of OCR errors , 2000, IS&T/SPIE Electronic Imaging.

[37]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[38]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[39]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[40]  Christian Viard-Gaudin,et al.  Statistical Language Models for On-Line Handwriting Recognition , 2005, IEICE Trans. Inf. Syst..

[41]  Nobuhiro Yugami,et al.  Effects of domain characteristics on instance-based learning algorithms , 2003, Theor. Comput. Sci..

[42]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[43]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .