A Generalized Links and Text Properties Based Forum Crawler

Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of web forums is necessary to gather/mine the information from them. However, a generic web crawler is unable to efficiently and effectively crawl the web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method.

[1]  Aoying Zhou,et al.  Automatic Extraction Rules Generation Based on XPath Pattern Learning , 2010, WISE Workshops.

[2]  Ari Pirkola,et al.  Addressing the limited scope problem of focused crawling using a result merging approach , 2010, SAC '10.

[3]  Lei Zhang,et al.  A Thread-wise Strategy for Incremental Crawling of Web Forums , 2008 .

[4]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Wei Liu,et al.  Automatically mining review records from forum Web sites , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[7]  Jianguo Lu,et al.  An Approach to Deep Web Crawling by Sampling , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[8]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[9]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[10]  Wei-Ying Ma,et al.  Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy , 2009, KDD.

[11]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[12]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[13]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.