Using a Hidden Markov Model to Learn User Browsing Patterns for Focused Web Crawling

A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. The focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. In this paper, we present a new approach for prediction of the important links to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is used to learn page sequences that are likely to lead to target pages. The learning is done using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare performance with Context-Graph crawling and Best-First crawling and experiments show that this approach performs better than other strategies.

[1]  C. Lee Giles,et al.  Evolving Strategies for Focused Web Crawling , 2003, ICML.

[2]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[3]  J. Curran,et al.  Domain-specific Web site identification: the CROSSMARC focused Web crawler , 2003 .

[4]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[5]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[6]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[9]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[10]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[11]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[12]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[13]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[14]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[15]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[16]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..