Extracting structures of HTML documents

Information on the Web, which are conglomeration of heterogeneous data, such as texts, images and audio clips, are often accessed through documents written according to the HTML specification. According to the HTML specification, HTML documents are semistructured in nature. We propose a high-level stack machine (HSM) which accesses an HTML document through its URL and constructs a semistructured data graph (SDG) of the document. The SDG of an HTML document H precisely captures the structure of the semistructured data embedded in H based on the dependency relationship among the data objects in H. HSM is configurable to accommodate a user's interest with respect to the HTML elements in H to be considered during the construction process of the SDG of H.

[1]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[2]  SuciuDan,et al.  A query language and optimization techniques for unstructured data , 1996 .

[3]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[4]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[5]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[6]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[7]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[8]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[9]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[10]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[11]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[12]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.

[13]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1997, J. Syst. Integr..

[14]  M. W. Shields An Introduction to Automata Theory , 1988 .

[15]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[16]  Yiu-Kai Ng,et al.  Vertical Fragmentation and Allocation in Distributed Deductive Database Systems , 1997, Inf. Syst..