Searching for Tables in Digital Documents

Tables are ubiquitous. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. Current search engines do not allow the end-user to search for relevant tables. In this paper, we describe TableSeer, an automatic table extraction and search engine system. TableSeer crawls scientific documents, identifies documents with tables, extracts tables from documents, indexes them and enables end-users to search for tables. We also propose an extensive set of medium-independent metadata for tables representation. Given a query, TableSeer ranks the returned results using an innovative ranking algorithm - TableRank. Our results show that TableSeer outperforms popular search engines, such as Google Scholar when the end-user seeks for tables.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[3]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[4]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[5]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[6]  Yalin Wang,et al.  Automatic table ground truth generation and a background-analysis-based table structure extraction method , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[7]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[8]  Naoki Asada,et al.  Graph grammar based analysis system of complex table form document , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..