Panorama: extending digital libraries with topical crawlers

A large amount of research, technical and professional documents are available today in digital formats. Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While tools that index, classify, rank and retrieve documents from such libraries are important, it would be worthwhile to complement these tools with information available on the Web. We propose one such technique that uses a topical crawler driven by the information extracted from a research document. The goal of the crawler is to harvest a collection of Web pages that are focused on the topical subspaces associated with the given document. The collection created through Web crawling is further processed using lexical and linkage analysis. The entire process is automated and uses machine learning techniques to both guide the crawler as well as analyze the collection it fetches. A report is generated at the end that provides visual cues and information to the researcher.

[1]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[5]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[10]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[11]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[12]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[13]  C. Lee Giles,et al.  A system for automatic personalized tracking of scientific literature on the Web , 1999, DL '99.

[14]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[18]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1994, JACM.

[19]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[20]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[21]  Hsinchun Chen,et al.  Information navigation on the web by clustering and summarizing query results , 2001, Inf. Process. Manag..

[22]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[23]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[24]  Donna Bergmark,et al.  Collection synthesis , 2002, JCDL '02.

[25]  C. Lee Giles,et al.  Evolving Strategies for Focused Web Crawling , 2003, ICML.

[26]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[27]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[28]  Hsinchun Chen,et al.  Personalized spiders for web search and analysis , 2001, JCDL '01.

[29]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[30]  Ian Witten,et al.  Data Mining , 2000 .

[31]  Alberto H. F. Laender,et al.  The Web-DL environment for building digital libraries from the Web , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[32]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.