Categorization of On-Line Handwritten Documents

With the growth of on-line handwriting technologies, managing facilities for handwritten documents, such as retrieval of documents by topic, are required. These documents can contain graphics, equations or text for instance. This work reports experiments on categorization of on-line handwritten documents based on their textual contents. We assume that handwritten text blocks have been extracted from the documents, and as a first step of the proposed system, we process them with an existing handwritten recognition engine. We analyse the effect of the word recognition rate on the categorization performances, and we compare them with those obtained with the same texts available as ground truth. Two categorization algorithms (kNN and SVM) are compared in this work. The handwritten texts are a subset of the Reuters-21578 corpus collected from more than 1500 writers. Results show that there is no significant categorization performance loss when the word error rate stands below 22%.

[1]  Karen Spärck Jones Experiments in relevance weighting of search terms , 1979, Inf. Process. Manag..

[2]  Guillaume Koch Catégorisation automatique de documents manuscrits : Application aux courriers entrants , 2006 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Christian Viard-Gaudin,et al.  Online writer identification using character prototypes distributions , 2008, Electronic Imaging.

[7]  Wataru Ohyama,et al.  The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification , 2006, Document Analysis Systems.

[8]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[9]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[10]  Anil K. Jain,et al.  Indexing and retrieval of on-line handwritten documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[14]  Rainer Hoch,et al.  An experimental evaluation of OCR text representations for learning document classifiers , 1998, International Journal on Document Analysis and Recognition.

[15]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[16]  Christian Viard-Gaudin,et al.  Statistical Language Models for On-Line Handwriting Recognition , 2005, IEICE Trans. Inf. Syst..

[17]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[18]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.