A Fast Preprocessing Method for Table Boundary Detection: Narrowing Down the Sparse Lines Using Solely Coordinate Information

As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classification and retrieval. Tables, as a specific document component, are ubiquitous everywhere. Accurately detecting the table boundary plays a crucial role for the later table structure decomposition and table data collection. In this paper, we propose an easy but effective table boundary detection method. Our method has two unique advantages comparing with other works in this field: 1) Because most tables are text-based, we claim that the text object of PDF itself is good enough for table detection. In addition, we believe that the font information is not so reliable as other works stated. 2) Based on the nature of the table cells, we notice the sparse-line property of table rows. By filtering out the non-sparse lines initially, the table boundary detection problem can be simplified into the sparse line analysis problem easily. The experimental results not only confirm the importance of the coordinate information, but also demonstrate the effectiveness of sparse lines in the table boundary detection. Combining with other keywords, our method is even applicable to detect other document components (e.g., mathematical formula or the references).

[1]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[2]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[3]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[4]  Thomas Kieninger,et al.  Applying the T-Recs table recognition system to the business letter domain , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  H.S. Baird,et al.  A retargetable table reader , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[6]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[7]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[8]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[9]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[10]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[11]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[12]  Jiwon Shin,et al.  Table Recognition and Evaluation , 2005 .