Topical web crawling using weighted anchor text and web page change detection techniques

In this paper, we discuss about the focused web crawler and relevance of anchor text as well as method for web page change detection for search engine. We have proposed a technique called weighted anchor text which uses the link structure to form the weighted directed graph of anchor texts. These weights are further used for deciding the relevance of the web pages as the indexing of these pages is done in the decreasing order of weights assigned to them. Weights are assigned for every incoming link for a node of the directed graph. We applied our algorithm on various websites and observed the results. We deduce that the algorithm can be very useful when incorporated with other existing algorithms. As Web usage has increased exponentially in the past few years. This collection of enormous web pages is highly changing and web pages show a rapid change, the degree of which varies from site to site. We discuss the relevance of change detection and then move on to explore the related work in the area. Based on this understanding we propose a new algorithm to map changes in a web page. After verifying results on various web pages we observe the relative merits of the proposed algorithm.

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[3]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[4]  George Samaras,et al.  Distributed location aware web crawling , 2004, WWW Alt. '04.

[5]  Charles L. A. Clarke,et al.  Topic-oriented collaborative crawling , 2002, CIKM '02.

[6]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[7]  C. Lee Giles,et al.  Designing efficient sampling techniques to detect webpage updates , 2007, WWW '07.

[8]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[9]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[10]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[11]  Hyun Chul Lee,et al.  Geographically focused collaborative crawling , 2006, WWW '06.

[12]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[13]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[14]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.

[15]  Daniel Rocco,et al.  Efficient web change monitoring with page digest , 2004, WWW Alt. '04.

[16]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[17]  A. K. Sharma,et al.  Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[18]  José Rufino,et al.  Geographical partition for distributed web crawling , 2005, GIR '05.

[19]  Toyoaki Nishida,et al.  IICA: An Ontology-based Internet Navigation System , 2002 .

[20]  Z. Dalai,et al.  Managing distributed collections: evaluating Web page changes, movement, and replacement , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[21]  Bernhard Seeger,et al.  Design and Implementation of a Geographic Search Engine , 2005, WebDB.

[22]  Fred Douglis,et al.  Tracking and Viewing Changes on the Web , 1996, USENIX Annual Technical Conference.

[23]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[24]  Oussama El-Rawas,et al.  An Efficient Web Page Change Detection System Based on an Optimized Hungarian Algorithm , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  Luis Gravano,et al.  Exploiting Geographical Location Information of Web Pages , 1999, WebDB.

[26]  Divakar Yadav,et al.  Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007 .

[27]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[28]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[29]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.