An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

The tremendous growth of the Web poses many challenges for all-purpose single-process crawlers including the presence of some irrelevant answers among search results and the coverage and scaling issues regarding the enormous dimension of the World Wide Web. Hence, more enhanced and convincing algorithms are on demand to yield more precise and relevant search results in an appropriate amount of time. Since employing link based Web page importance metrics within a multi-processes crawler bears a considerable communication overhead on the overall system and cannot produce the precise answer set, employing these metrics in search engines is not an absolute solution to identify the best search answer set by the overall search system. Thus considering the employment of a link independent Web page importance metric is required to govern the priority rule within the queue of fetched URLs. The aim of this paper is to propose a modest weighted architecture for a focused structured parallel Web crawler which employs a link independent clickstream based Web page importance metric. The experiments of this metric over the restricted boundary Web zone of our crowded UTM University Web site shows the efficiency of the proposed metric.

[1]  Philip S. Yu,et al.  Adding the temporal dimension to search - a case study in publication search , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[2]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[3]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[4]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[5]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[6]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[7]  Fatemeh Ahmadi-Abkenari,et al.  Application of clickstream analysis as Web page importance metric in parallel crawlers , 2010, 2010 International Symposium on Information Technology.

[8]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[9]  Bing Liu Information Retrieval and Web Search , 2011 .

[10]  Fatemeh Ahmadi-Abkenari,et al.  Architecture for a Parallel Focused Crawler for Clickstream Analysis , 2011, ACIIDS.

[11]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[14]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[15]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[16]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[17]  Pasi Fränti,et al.  Web Data Mining , 2009, Encyclopedia of Database Systems.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[19]  Paolo Giudici,et al.  Applied Data Mining: Statistical Methods for Business and Industry , 2003 .

[20]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[21]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.