A web table extraction algorithm based on tree edit distance

Web tables widely exist in the real world, including online shopping, supply-demand information pages and searching results. It is hence a necessary and significant issue to extract structural table data from Web tables. However, semi-structured Web tables are inexpediently used for Web application systems, such as users' recommend, supply and demand analysis systems. Web pages can be parsed into tree structures. Web table information in the parse tree presents a conspicuous hierarchy structure. Meanwhile, for homologous Web table data regions, their corresponding sub-tree structures present a similar characteristic. Motivated by this, a data region extraction method based on the top-down tree edit distance is proposed in this paper, called EtractDRs. It uses the tree edit distance to measure the similarity of tree structures, merges those structures whose edit distances are lower than a pre-specified threshold to form candidate table data regions, and adopts heuristic rules to get the final data regions. Experimental studies conducted on table data from 25 Web sites demonstrate that in comparison to the state-of-the-art MDR algorithm using the string edit distance, our algorithm can improve the recall value and the F value by a large margin up to 39.4% and 26.15% respectively, while it still maintains a better performance on the accuracy.

[1]  Li Xing-yua A Robust Method for Unknown Structure Form Analysis , 1999 .

[2]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[3]  Eric Crestan,et al.  Web-scale knowledge extraction from semi-structured tables , 2010, WWW '10.

[4]  Yalin Wang,et al.  Zone content classification and its performance evaluation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Hasan M. Jamil,et al.  An Efficient Web-Based Wrapper and Annotator for Tabular Data , 2010, Int. J. Softw. Eng. Knowl. Eng..

[6]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[7]  Cui Tao,et al.  Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure , 2002, ER.

[8]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[9]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[10]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[11]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[12]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[13]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[14]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[15]  Beiji Zou,et al.  Information Extraction Based on Table Area Locating for E-Commerce Websites , 2009, 2009 WRI Global Congress on Intelligent Systems.

[16]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.