Predicting Semantic Labels of Text Regions in Heterogeneous Document Images

This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the Conditional Random Field method is robust, outperforming the neural network when limited training data is available. Regarding generalizability, our experiments show that the inclusion of textual features does not guarantee performance improvements.

[1]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Daniel X. Le,et al.  Automated labeling in document images , 2000, IS&T/SPIE Electronic Imaging.

[3]  Marcel Worring,et al.  The UvA color document dataset , 2004, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Sargur N. Srihari,et al.  Segmentation and labeling of documents using conditional random fields , 2007, Electronic Imaging.

[5]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[6]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[7]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[10]  Abdel Belaïd,et al.  Table Detection in Handwritten Chemistry Documents Using Conditional Random Fields , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[11]  Georg Groh,et al.  Sequence Labeling: A Practical Approach , 2018, ArXiv.

[12]  Adnan Amin,et al.  Page Segmentation and Classification Utilizing Bottom-Up Approach , 2001, Int. J. Image Graph..

[13]  Urszula Markowska-Kaczmar,et al.  Image-based logical document structure recognition , 2014, Pattern Analysis and Applications.

[14]  Simone Marinai,et al.  Introduction to Document Analysis and Recognition , 2008, Machine Learning in Document Analysis and Recognition.

[15]  Abdel Belaïd,et al.  Labelling logical structures of document images using a dynamic perceptive neural network , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[16]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[17]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[18]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[19]  Stavros J. Perantonis,et al.  Integrated algorithms for newspaper page decomposition and article tracking , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[20]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[21]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gueesang Lee,et al.  Dense prediction for text line segmentation in handwritten document images , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[23]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Iryna Gurevych,et al.  Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks , 2017, ArXiv.

[25]  Apostolos Antonacopoulos,et al.  Page Segmentation Using the Description of the Background , 1998, Comput. Vis. Image Underst..

[26]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.

[27]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[28]  Zhi Tang,et al.  Document page structure learning for fixed-layout e-books using conditional random fields , 2013, Electronic Imaging.

[29]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[30]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.