A machine learning based approach for table detection on the web

Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.

[1]  Daniel P. Lopresti,et al.  Why table ground-truthing is hard , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[3]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[4]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[5]  Jun'ichi Tsujii,et al.  A method to integrate tables of the World Wide Web , 2001 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Emanuele Trucco,et al.  Computer and Robot Vision , 1995 .

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[10]  Linda G. Shapiro,et al.  Computer and Robot Vision (Volume II) , 2002 .

[11]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[12]  Alistair Moffat,et al.  Information Retrieval Systems for Large Document Collections , 1994, TREC.

[13]  Linda G. Shapiro,et al.  Computer and Robot Vision , 1991 .

[14]  Jianying Hu,et al.  Flexible Web document analysis for delivery to narrow-bandwidth devices , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[15]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[16]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.