Indexing and retrieval of scientific literature

The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and the major web search engines typically do not index the content of Postscript/PDF documents at all. This paper discusses the creation of digital libraries of scientific literature on the web, including the efficient location of articles, full-text indexing of the articles, autonomous citation indexing, information extraction, display of query-sensitive summaries and citation context, hubs and authorities computation, similar document detection, user profiling, distributed error correction, graph analysis, and detection of overlapping documents. The software for the system is available at no cost for non-commercial use.

[1]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[4]  Jan Pedersen Optimizations for Dynamic Inverted Index Maintenance Inverted Indices , 1990 .

[5]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[6]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[7]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[8]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[9]  C. Lee Giles,et al.  A system for automatic personalized tracking of scientific literature on the Web , 1999, DL '99.

[10]  John M. Barrie,et al.  The World Wide Web as an Instructional Tool , 1996, Science.

[11]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[12]  Bart Selman,et al.  Referral Web: combining social networks and collaborative filtering , 1997, CACM.

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  Ian H. Witten,et al.  Building a Digital Library for Computer Science Research: Technical Issues , 1996 .

[15]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[16]  Ian H. Witten,et al.  Digital Libraries Based on Full-Text Retrieval , 1996, WebNet.

[17]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[18]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[19]  Robert D. Cameron,et al.  A Universal Citation Database as a Catalyst for Reform in Scholarly Communication , 1997, First Monday.

[20]  Les Carr,et al.  Citation linking: improving access to online journals , 1997, DL '97.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  C. Lee Giles,et al.  Distributed error correction , 1999, DL '99.

[23]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[24]  Paul Ginsparg,et al.  First steps towards electronic research communication , 1994 .

[25]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[26]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[27]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[28]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[29]  Giles,et al.  Searching the world wide Web , 1998, Science.

[30]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[31]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.