Learning to detect tables in document images using line and text information

Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to readers in a structured manner. It is still a challenging problem due to the variety of table structures and the complexity of document layout. This paper presents a hybrid method consisting of three fundamental steps to detect table zones: classification of the regions, detection of the tables that constitute intersecting horizontal and vertical lines, and identification of the tables made up by only parallel lines. Experiments on the UW-III dataset show that the obtained results are very promising.

[1]  C. Lee Giles,et al.  Identifying table boundaries in digital documents via sparse line detection , 2008, CIKM '08.

[2]  Clément Chatelain,et al.  Learning to Detect Tables in Scanned Document Images Using Line Information , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  Gaurav Harit,et al.  Table detection in document images using header and trailer patterns , 2012, ICVGIP '12.

[4]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[5]  Roshan G. Ragel,et al.  Locating tables in scanned documents for reconstructing and republishing , 2014, 7th International Conference on Information and Automation for Sustainability.

[6]  Ana Costa e Silva,et al.  Metrics for evaluating performance in document analysis: application to tables , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Kun Bai,et al.  Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Ruiheng Qiu,et al.  A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Soo-Hyung Kim,et al.  Hybrid page segmentation using multilevel homogeneity structure , 2015, IMCOM.

[10]  Soo-Hyung Kim,et al.  A hybrid method for table detection from document image , 2015, ACPR.

[11]  Hyung Jeong Yang,et al.  A mixture model using Random Rotation Bounding Box to detect table region in document image , 2016, J. Vis. Commun. Image Represent..

[12]  Soo-Hyung Kim,et al.  A robust system for document layout analysis using multilevel homogeneity structure , 2017, Expert Syst. Appl..

[13]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[14]  Abdel Belaïd,et al.  Table Detection in Handwritten Chemistry Documents Using Conditional Random Fields , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[15]  Giorgio Orsi,et al.  A methodology for evaluating algorithms for table understanding in PDF documents , 2012, DocEng '12.

[16]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..