Learning and Discovering Structure in Web Pages

Because much of the information on the web is presented in some sort of regular, repeated format, “understanding” web pages often requires recognizing and using structure, where structure is typically defined by hyperlinks between pages and HTML formatting commands within a page. We survey some of the ways in which structure within a web page can be used to help machines understand pages. Specifically, we review past research on techniques that automatically learn and discover web-page structure. These techniques are important for wrapper-learning, an important and active research area, and can be beneficial for tasks as diverse as classification of entities mentioned on the web, collaborative filtering for music, web page classification, and entity extraction from web pages.

[1]  William W. Cohen Automatically Extracting Features for Concept Learning from the Web , 2000, International Conference on Machine Learning.

[2]  Nicholas Kushmerick Wrapper induction: Efficiency and expressiveness (Extended abstract) , 1998 .

[3]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[4]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[5]  Andrew McCallum,et al.  An Interoperable Multimedia Catalog System for Electronic Commerce. , 2000 .

[6]  William W. Cohen Improving a Page Classifier with Anchor Extraction and Link Analysis , 2002, NIPS.

[7]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[8]  Cui Tao,et al.  Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure , 2002, ER.

[9]  Craig A. Knoblock,et al.  Modeling Web Sources for Information Integration , 1998, AAAI/IAAI.

[10]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[11]  Joann J. Ordille,et al.  Query-Answering Algorithms for Information Agents , 1996, AAAI/IAAI, Vol. 1.

[12]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[13]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[14]  Boris Chidlovskii Information Extraction from Tree Documents by Learning Subtree Delimiters , 2003, IIWeb.

[15]  William W. Cohen,et al.  Web-collaborative filtering: recommending music by crawling the Web , 2000, Comput. Networks.

[16]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[17]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[18]  Andrew McCallum,et al.  Learning with Scope, with Application to Information Extraction and Classification , 2002, UAI.

[19]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[20]  Chia-Hui Chang,et al.  Reconfigurable Web Wrapper Agents for Web Information Integration , 2003, IIWeb.

[21]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[22]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[23]  Steven Minton,et al.  Trainability: Developing a responsive learning system , 2003, IIWeb.

[24]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[25]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.