Among the HTML elements, HTML tables [RHJ98] encapsulate hierarchically structured data (hierarchical data in short) in a tabular structure. HTML tables do not come with a rigid schema and almost any forms of two-dimensional tables are acceptable according to the HTML grammar. This relaxation complicates the process of retrieving hierarchical data from HTML tables. In this paper, we propose an automated approach for retrieving hierarchical data from HTML tables. The proposed approach constructs the content tree of an HTML table, which captures the intended hierarchy of the data content of the table, without requiring the internal structure of the table to be known beforehand. Also, the user of the content tree does not deal with HTML tags while retrieving the desired data from the content tree. Our approach can be employed by (i) a query language written for retrieving hierarchically structured data, extracted from either the contents of HTML tables or other sources, (ii) a processor for converting HTML tables to XML documents, and (iii) a data warehousing repository for collecting hierarchical data from HTML tables and storing materialized views of the tables. The time complexity of the proposed retrieval approach is proportional to the number of HTML elements in an HTML table.
[1]
Guido Moerkotte,et al.
Querying documents in object databases
,
1997,
International Journal on Digital Libraries.
[2]
Hector Garcia-Molina,et al.
Extracting Semistructured Information from the Web.
,
1997
.
[3]
Jennifer Widom,et al.
The Lorel query language for semistructured data
,
1997,
International Journal on Digital Libraries.
[4]
David Konopnicki,et al.
W3QS: A Query System for the World-Wide Web
,
1995,
VLDB.
[5]
Alberto O. Mendelzon,et al.
Formal models of Web queries
,
1997,
Inf. Syst..
[6]
Paolo Merialdo,et al.
To Weave the Web
,
1997,
VLDB.
[7]
Alberto O. Mendelzon,et al.
WebOQL: restructuring documents, databases and Webs
,
1998,
Proceedings 14th International Conference on Data Engineering.
[8]
Yiu-Kai Ng,et al.
WebView: a tool for retrieving internal structures and extracting information from HTML documents
,
1999,
Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.