PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain‐specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear‐chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best‐First Search (BFS), Hidden Markov Model (HMM), and Context‐graph Search (CGS).

[1]  Ah Chung Tsoi,et al.  A Simple Focused Crawler , 2003, The Web Conference.

[2]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[3]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[4]  Gautam Pant,et al.  Panorama: extending digital libraries with topical crawlers , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[5]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[6]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[7]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[8]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[9]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[10]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[11]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[12]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[13]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[14]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[15]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[16]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[17]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[18]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[19]  Miroslav Dudík,et al.  Maximum Entropy Distribution Estimation with Generalized Regularization , 2006, COLT.

[20]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[21]  Filippo Menczer Links tell us about lexical and semantic Web content , 2001, ArXiv.

[22]  C. Lee Giles,et al.  Evolving Strategies for Focused Web Crawling , 2003, ICML.

[23]  YoungSik Choi,et al.  A Focused Crawling for the Web Resource Discovery Using a Modified Proximal Support Vector Machines , 2005, ICCSA.

[24]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  施宗昆,et al.  使用隱藏式馬可夫模型之特定網頁資訊抓取蒐集; Focused Crawling for Information Gathering Using Hidden Markov Model , 2007 .

[26]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[27]  Frank Parry,et al.  The Invisible Web: Uncovering Information Sources Search Engines Can’t See , 2002 .

[28]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[29]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[30]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[31]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[32]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[33]  Bo Yuan,et al.  A cross-language focused crawling algorithm based on multiple relevance prediction strategies , 2009, Comput. Math. Appl..

[34]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[35]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[36]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[37]  Guilherme Tavares de Assis,et al.  The impact of term selection in genre-aware focused crawling , 2008, SAC '08.

[38]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[39]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[40]  Yasubumi Sakakibara,et al.  RNA secondary structural alignment with conditional random fields , 2005, ECCB/JBI.

[41]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[42]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[43]  Patricia Bouyer,et al.  Improved undecidability results on weighted timed automata , 2006, Inf. Process. Lett..

[44]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[45]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[46]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[47]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[48]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49]  Ioannis Pitas,et al.  Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[50]  Evangelos E. Milios,et al.  Using HMM to learn user browsing patterns for focused Web crawling , 2006, Data & Knowledge Engineering.

[51]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2004, WIDM '04.

[52]  Jaime G. Carbonell,et al.  Segmentation Conditional Random Fields (SCRFs): A New Approach for Protein Fold Recognition , 2005, RECOMB.

[53]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..