Tree-Structured Template Generation for Web Pages

As the web becomes an increasingly important source of information, tools for modeling, searching, and extracting information from Web pages are indispensable. By modeling the structure of a Web page defined by its markup tags, one can easily extract target information using structural templates. This paper introduces the Tree Template Automatic Generator (TTAG) that learns tree-structured templates from training Web pages. TTAG was applied to both query-based and frequently updated Web sites, and produced effective templates from a small number of examples. The experiments show that TTAG is a powerful extraction tool for semi-structured information sources.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[3]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[4]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[5]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[6]  Jane Yung-jen Hsu,et al.  Template-Based Information Mining from HTML Documents , 1997, AAAI/IAAI.

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[9]  Andrzej Skowron,et al.  Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence , 2005 .

[10]  Kam-Fai Wong,et al.  KPS: a Web Information Mining Algorithm , 1999, Comput. Networks.

[11]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[12]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[13]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[14]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[15]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[16]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[17]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.