Generalized and lightweight algorithms for automated web forum content extraction

As online forums contain a vast amount of information that can aid in the early detection of fraud and extremist activities, accurate and efficient information extraction from forum sites is very important. In this paper, we discuss the limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extraction while eliminating noisy data from forum sites. In this paper, we propose three generalized and lightweight algorithms to carry out accurate thread and post content extraction from web forums. We evaluate our algorithms based on two strict criteria and to the granularity of the (DOM tree) node level correctness. We consider a thread or post as successfully extracted by our algorithms only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experiments on ten different forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Amir Masoud Rahmani,et al.  Main Content Extraction from Detailed Web Pages , 2010 .

[4]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[5]  Vrizlynn L. L. Thing,et al.  An enhanced intelligent forum crawler , 2012, 2012 IEEE Symposium on Computational Intelligence for Security and Defence Applications.

[6]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[7]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[8]  Vrizlynn L. L. Thing,et al.  A Generalized Links and Text Properties Based Forum Crawler , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[9]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  A Manju.,et al.  Automated Path Ascend Forum Crawling , 2013 .

[12]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[13]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[14]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[15]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[16]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[17]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[18]  Li Fan,et al.  Dark web forums portal: Searching and analyzing jihadist forums , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[19]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[20]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[21]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[22]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[23]  Hsinchun Chen,et al.  Collection of U.S. Extremist Online Forums: A Web Mining Approach , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[24]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[25]  Vrizlynn L. L. Thing,et al.  A Lightweight Algorithm for Automated Forum Information Processing , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[26]  Mark S. Ackerman,et al.  Expertise networks in online communities: structure and algorithms , 2007, WWW '07.