Incorporating site-level knowledge to extract structured data from web forums

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.

[1]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[2]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[6]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[7]  Mark S. Ackerman,et al.  Expertise networks in online communities: structure and algorithms , 2007, WWW '07.

[8]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[9]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[10]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[11]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[12]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[13]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[14]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[15]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[16]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[17]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[18]  Pedro M. Domingos,et al.  Discriminative Training of Markov Logic Networks , 2005, AAAI.

[19]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[20]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.