TableSeer: automatic table metadata extraction and searching in digital libraries

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a user-friendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm - TableRank. TableRank rates each ⃭query, tableℂ pair with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.

[1]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[2]  H.S. Baird,et al.  A retargetable table reader , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[3]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[4]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[5]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[6]  Thomas Kieninger,et al.  Applying the T-Recs table recognition system to the business letter domain , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[7]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[8]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[9]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[10]  Naoki Asada,et al.  Graph grammar based analysis system of complex table form document , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[13]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[14]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[15]  Jiwon Shin,et al.  Table Recognition and Evaluation , 2005 .

[16]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[17]  Yalin Wang,et al.  Automatic table ground truth generation and a background-analysis-based table structure extraction method , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[18]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[19]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[20]  Kun Bai,et al.  TableRank: A Ranking Algorithm for Table Search and Retrieval , 2007, AAAI.