A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates

Web information extraction is the key part of web data integration. With the need of e-commerce website and the development of web design, web pages with multiple presentation templates arise. The current web information extraction systems are usually based on single presentation template, so web pages with multiple presentation templates can’t be extracted efficiently. This paper focuses on the extraction problem about web pages with multiple presentation templates. Four different kinds of this problem have been considered, and a novel method based on path entropy, presentation regularity and ontology knowledge is presented. The experiment indicates that this method is very promising and it achieves excellent recall and precision.

[1]  Wai Lam,et al.  Learning to extract hierarchical information from semi-structured documents , 2000, CIKM '00.

[2]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[3]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[4]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[5]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[6]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[10]  Nan Wang,et al.  Automatic composite wrapper generation for semi-structured biological data based on table structure identification , 2004, SGMD.

[11]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[12]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[15]  Hasan Davulcu,et al.  Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge , 2007, World Wide Web.

[16]  Weiyi Meng,et al.  Vision-based Web Data Records Extraction , 2006, WebDB.

[17]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[18]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[19]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[20]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[21]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[22]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.