On using alternative recognition candidates and scores for handwritten documents classification

This paper compares different strategies for automatic transcription representation in the scope of handwritten documents classification. The classical approach learns a statistical classifier directly from the recognizer’s output, however it doesn’t take into account the specificities of automatic text recognition: presence of errors and availability of confidence scores along with recognition alternatives. We propose here a method that considers these aspects. We suggest to use confidence scores as weights for the classifier’s input features vectors and to take into account the n-best recognition alternatives. Using three handwritten documents databases and different families of statistical classifiers, we show that thanks to this approach, classification results are consistently improved.

[1]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Christian Viard-Gaudin,et al.  Categorization of On-Line Handwritten Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[4]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[5]  Clément Chatelain,et al.  A categorization system for handwritten documents , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[7]  Christian Viard-Gaudin,et al.  Using top n Recognition Candidates to Categorize On-line Handwritten Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Emmanuel Augustin,et al.  RIMES evaluation campaign for handwritten mail processing , 2006 .

[9]  Christian Viard-Gaudin,et al.  Impact of online handwriting recognition performance on text categorization , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Ping Li,et al.  Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost , 2010, UAI.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Daniel P. Lopresti,et al.  Optical character recognition errors and their effects on natural language processing , 2008, AND '08.

[13]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[14]  Christopher Kermorvant,et al.  Handwritten Mail Classification Experiments with the Rimes Database , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[15]  Alfons Juan-Císcar,et al.  Spontaneous handwriting recognition and classification , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[19]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[20]  Christopher Kermorvant,et al.  The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition , 2012, Electronic Imaging.