Topic-Specific Crawling on the Web with Concept Context Graph Based on FCA

Topic-specific crawling is a method which can not crawl all the webpage, but only crawls the web pages which are related to users' interests. The web pages which have high relevancy of the users' interests should be crawled first. The major problem in focused crawling is how to assign proper credits to the unvisited pages the crawling will visit. In this paper, we propose an effective approach using concept context graph based on Formal Concept Analysis to solve this problem. We build a concept lattice with the visited pages, and then use a method of combination of the term to construct our concept context graph based on the upper concept lattice. Our crawler can measure a page's expected relevancy to a given topic and determine the order in which pages should be visited first. An experiment illustrates that the new method is an effective mechanism which have a considerable result.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Anna Formica,et al.  Ontology-based concept similarity in Formal Concept Analysis , 2006, Inf. Sci..

[3]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[4]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[5]  Ioannis Pitas,et al.  Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[6]  Anna Formica,et al.  Concept similarity in Formal Concept Analysis: An information content approach , 2008, Knowl. Based Syst..

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[9]  Jingyu Sun,et al.  A Topic-Specific Web Crawler with Concept Similarity Context Graph Based on FCA , 2008, ICIC.

[10]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Fan Wu,et al.  Topic-specific crawling on the Web with the measurements of the relevancy context graph , 2006, Inf. Syst..

[13]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.