Data extraction from web pages based on structural-semantic entropy

Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.

[1]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[2]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[3]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4]  John N. Hooker,et al.  A linear programming framework for logics of uncertainty , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[5]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[6]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[7]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[8]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[9]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Fidel Cacheda,et al.  Extracting lists of data records from semi-structured web pages , 2008, Data Knowl. Eng..

[11]  Chia-Hui Chang,et al.  OLERA: Semisupervised Web-Data Extraction with Visual Support , 2004, IEEE Intell. Syst..

[12]  Chia-Hui Chang,et al.  OLERA: A Semi-supervised Approach for Web Data Extraction with Visual Support , 2003 .

[13]  Keith L. Clark,et al.  Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[14]  Hasan Davulcu,et al.  Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge , 2007, World Wide Web.

[15]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[16]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[17]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[18]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[19]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Valter Crescenzi,et al.  Automatic Web Information Extraction in the ROADRUNNER System , 2001, ER.

[21]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[22]  Wai Lam,et al.  An unsupervised method for joint information extraction and feature mining across different Web sites , 2009, Data Knowl. Eng..