A Lightweight Algorithm for Automated Forum Information Processing

The vast variety of information on Web forums makes them a valuable resource for various purposes such as scam detection, national security protection and sentiment analysis. However, it is challenging to extract useful information from Web forums accurately and efficiently. First, several page types exist in Web forums and content is presented in different formats in these pages. Second, the content on the forum pages is stored in the form of data blocks. For the information to be meaningful, it is necessary to extract the relevant data blocks separately. The main problem with generic content extraction systems is that they cannot distinguish among various pages nor extract information with the required granularity. Although, several content extraction methods exist for Web forums, these methods either do not satisfy the above requirements or use heuristics based approaches (such as assumptions on standard visual appearances, etc., resulting in limited applicability to different varieties of forum). In this paper, we propose a general and efficient content extraction method using the properties of links present in forum pages. The effectiveness of our proposed method is shown through our experimental results.

[1]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[2]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[3]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[4]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[5]  Alexander Schill,et al.  FODEX -- Towards Generic Data Extraction from Web Forums , 2012, 2012 26th International Conference on Advanced Information Networking and Applications Workshops.

[6]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[7]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[8]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[9]  Vrizlynn L. L. Thing,et al.  A Generalized Links and Text Properties Based Forum Crawler , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[10]  Amir Masoud Rahmani,et al.  Main Content Extraction from Detailed Web Pages , 2010 .

[11]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[12]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  Vrizlynn L. L. Thing,et al.  An enhanced intelligent forum crawler , 2012, 2012 IEEE Symposium on Computational Intelligence for Security and Defence Applications.

[15]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[16]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[17]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[18]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.