CiteSeerχ: a scalable autonomous scientific digital library

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the fields of computer and information science. Since it's inception in 1997 CiteSeer has grown to index over 730,000 documents and serves over 800,000 requests daily, pushing the limits of the current system's capabilities. In addition, CiteSeer's monolithic architecture inconveniences system maintenance and reduces the flexibility of the system in terms of new feature development, algorithm updates, and system interoperability. In this paper, we discuss the problems of the current CiteSeer architecture and propose a new architecture for a next generation CiteSeer application. The new architecture is based on modular web services and pluggable service components. Preliminary results based on a prototype system show the new architecture enhances flexibility, scalability, and performance for CiteSeer. In addition, new services in development for the next generation CiteSeer system are discussed.

[1]  E Garfield,et al.  "Science Citation Index"--A New Dimension in Indexing. , 1964, Science.

[2]  Gregory R. Crane,et al.  Building a digital library: the Perseus project as a case study in the humanities , 1996, DL '96.

[3]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[4]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[5]  C. Lee Giles,et al.  Indexing and retrieval of scientific literature , 1999, CIKM '99.

[6]  Wensong Zhang Linux Virtual Server for Scalable Network Services , 2000 .

[7]  Kristian J. Hammond,et al.  Guiding people to information: providing an interface to a digital library using reference as a basis for indexing , 2000, IUI '00.

[8]  Edward A. Fox,et al.  Preservation and transition of NCSTRL using an OAI-based architecture , 2002, JCDL '02.

[9]  Carl Lagoze,et al.  Core services in the architecture of the national science digital library (NSDL) , 2002, JCDL '02.

[10]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[11]  David A. Elko,et al.  A Logger System based on Web services , 2004, IBM Syst. J..

[12]  Mark Martinez,et al.  A Large-Scale Digital Library System to Integrate Heterogeneous Data of Distributed Databases , 2004, Euro-Par.

[13]  A. Kumar,et al.  Architecting an extensible digital repository , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[14]  Hui Han,et al.  CiteSeer-API: towards seamless resource location and interlinking for digital libraries , 2004, CIKM '04.

[15]  Sandra Payette,et al.  Fedora: an architecture for complex objects and their relationships , 2005, International Journal on Digital Libraries.

[16]  David R. Karger,et al.  OverCite: A Cooperative Digital Research Library , 2005, IPTPS.

[17]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[18]  Robert Wilensky,et al.  A framework for distributed digital object services , 2006, International Journal on Digital Libraries.

[19]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).