Automatically mining review records from forum Web sites

The rapid development of Web 2.0 bring the flourish of web reviews. Web reviews are usually released in form of structured records. As the important information source for many popular applications(e.g. monitoring and analysis of public opinion), review records need to be extracted accurately from web pages. To the best of our knowledge, little work in literatures has systemically investigated this problem. Besides the variety of web page templates, the user-generated review contents raises a new challenge. The inconsistency of review contents on both DOM tree and visual appearance impair the similarity among review records, which makes a serious impact on performance of the existing solutions on web data record extraction. To tackle this challenge, we propose a novel approach that performs automatic extraction of review records by employing sophisticated techniques. Our experimental results over 20 forum web sites indicate that the proposed approach can achieve high extraction accuracy.

[1]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[4]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[5]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[6]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[7]  Chia-Hui Chang,et al.  Automatic information extraction from semi-structured Web pages by pattern discovery , 2003, Decis. Support Syst..

[8]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[9]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[10]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[11]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[12]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[13]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[15]  Clement T. Yu,et al.  Annotating Structured Data of the Deep Web , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[17]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[18]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[19]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[20]  Wei-Ying Ma,et al.  Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy , 2009, KDD.

[21]  Weiyi Meng,et al.  Vision-based Web Data Records Extraction , 2006, WebDB.

[22]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[23]  Andrzej Lingas,et al.  A Fast Algorithm for Optimal Alignment between Similar Ordered Trees , 2001, CPM.

[24]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.