Extraction Rule Language for Web Information Extraction and Integration

The Web is the largest data source that contains a lot of valuable information of interests to users or applications. However, how to automatically navigate and extract useful data from web pages is an important issue to study. There have been a number of existing studies on this area. However, most of them do not take enough consideration on complete process of web information extraction and lack of powerful rule expression ability to describe the navigation, extraction and integration rules. In this paper, we study and propose a new web information extraction rule language toward a general model for web information extraction and integration. We first introduce a source data objects to extract different type of web data records. Then we adopt the XML to define the target data entity structure and use scripts to perform target data record integration. The results show that our extraction rule language can provide powerful and flexible ability to describe extraction logic to achieve accurate web data records extraction from complex web pages.

[1]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jer Lang Hong,et al.  Information extraction for search engines using fast heuristic techniques , 2010, Data Knowl. Eng..

[3]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[4]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[5]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[6]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[8]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[9]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[10]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[11]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[15]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[16]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[17]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[18]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[19]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.