Exploiting content redundancy for web information extraction

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.

[1]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[2]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[3]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[4]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[5]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[6]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[7]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[8]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[9]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[11]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[12]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[13]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[14]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[15]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[16]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[17]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[18]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[19]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[20]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[21]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[24]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.