What's there and what's not?: focused crawling for missing documents in digital libraries

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures

[1]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  Byung-Won On,et al.  PaSE: Locating Online Copy of Scientific Documents Effectively , 2004, ICADL.

[4]  Geert-Jan Houben,et al.  Information Retrieval in Distributed Hypertexts , 1994, RIAO.

[5]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[6]  Gautam Pant,et al.  Panorama: extending digital libraries with topical crawlers , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[7]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[8]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Alistair Moffat,et al.  Homepage Finding and Topic Distillation Using a Common Retrieval Strategy , 2002, TREC.

[11]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[12]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[13]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[14]  C. Lee Giles,et al.  Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing , 2004, Proc. Natl. Acad. Sci. USA.

[15]  M. C. Garrido,et al.  Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians , 1998, J. Artif. Intell. Res..

[16]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[17]  Richard C. H. Connor,et al.  TypEx: A Type Based Approach to XML Stream Querying , 2003, WebDB.

[18]  Gerd Hoff,et al.  Finding scientific papers with homepagesearch and MOPS , 2001, SIGDOC '01.

[19]  Sergio Greco,et al.  Weighted Path Queries on Web Data , 2001, International Workshop on the Web and Databases.

[20]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[21]  James P. Callan,et al.  Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding , 2003, TREC.

[22]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[23]  Charu C. Aggarwal,et al.  On Learning Strategies for Topic Specic Web Crawling , 2004 .

[24]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[25]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[26]  Neel Sundaresan,et al.  Using Metadata to Enhance a Web Information Gathering System , 2000, WebDB.