Text Classification and Document Layout Analysis of Paper Fragments

In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.

[1]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[2]  K. R. Arvind,et al.  A Robust Two Level Classification Algorithm for Text Localization in Documents , 2007, ISVC.

[3]  Zsolt Miklós Kovács-Vajna,et al.  A system for machine-written and hand-written character distinction , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Umapada Pal,et al.  Document-Zone Classification in Torn Documents , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[5]  Shijian Lu,et al.  Binarization of historical document images using the local maximum and minimum , 2010, DAS '10.

[6]  Its'hak Dinstein,et al.  2009 10th International Conference on Document Analysis and Recognition Line segmentation for degraded handwritten historical documents , 2022 .

[7]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[8]  Syed Saqib Bukhari,et al.  Script-Independent Handwritten Textlines Segmentation Using Active Contours , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Mohamed Cheriet,et al.  A New Approach for Skew Correction of Documents Based on Particle Swarm Optimization , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[10]  Basilios Gatos,et al.  ICDAR2005 page segmentation competition , 2007, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  G. Toussaint Solving geometric problems with the rotating calipers , 1983 .

[12]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Robert Sablatnig,et al.  Document analysis applied to fragments: feature set for the reconstruction of torn documents , 2010, DAS '10.

[14]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.