Multi Keyword Web Crawling Using Ontology inWeb Forums

Internet forums are online discussion sites where people can do conversations in the form of messages. Each forum is having sub-forums and it contains different topics based on the people’s discussion. Crawling is the initial and the most important step during the Web searching procedure. Existing system presents a supervised web-scale forum crawler called Forum Crawler under Supervision (FoCUS). The goal of the FoCUS is to collect the forum pages with minimum overhead. During Crawling, the existing system uses only the single keyword method to crawl the web pages. It does not discover new threads and also does not refresh the crawled threads in a timely manner. The above two problems are rectified in the proposed system by using Ontology concept for Multi Keyword web crawling and Temporal database for discovering new threads. To improve the efficiency the proposed new crawler collects web pages for indexing from the web. By using Ontology concept, the crawling efficiency will be increased and also page coverage will be increased.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[3]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[4]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Edleno Silva de Moura,et al.  Structure-driven crawler generation by example , 2006, SIGIR.

[7]  Hema Swetha Koppula,et al.  Learning URL patterns for webpage de-duplication , 2010, WSDM '10.

[8]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[10]  Anirban Dasgupta,et al.  De-duping URLs via rewrite rules , 2008, KDD.

[11]  Li Kui Crawling Dynamic Web Pages in WWW Forums , 2007 .

[12]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[13]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[14]  Mark S. Ackerman,et al.  Expertise networks in online communities: structure and algorithms , 2007, WWW '07.

[15]  Uri Schonfeld,et al.  Sitemaps: above and beyond the crawl of duty , 2009, WWW '09.

[16]  Suk Hwan Lim,et al.  Extracting and Ranking Product Features in Opinion Documents , 2010, COLING.