Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment

CiteSeer began as the first search engine for scientific literature to incorporate Autonomous Citation Indexing, and has since grown to be a well-used, open archive for computer and information science publications, currently indexing over 730,000 academic documents. However, CiteSeer currently faces significant challenges that must be overcome in order to improve the quality of the service and guarantee that CiteSeer will continue to be a valuable, up-to-date resource well into the foreseeable future. This paper describes a new architectural framework for CiteSeer system deployment, named CiteSeer Plus. The new framework supports distributed indexing and storage for load balancing and fault-tolerance as well as modular service deployment to increase system flexibility and reduce maintenance costs. In order to facilitate novel approaches to information extraction, a blackboard framework is built into the architecture.

[1]  E Garfield,et al.  "Science Citation Index"--A New Dimension in Indexing. , 1964, Science.

[2]  Hui Han,et al.  A service-oriented architecture for digital libraries , 2004, ICSOC '04.

[3]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[4]  David R. Karger,et al.  OverCite: A Cooperative Digital Research Library , 2005, IPTPS.

[5]  Brandon L. Buteau A generic framework for distributed, cooperating blackboard systems , 1990, CSC '90.

[6]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[7]  C. Lee Giles,et al.  Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing , 2004, Proc. Natl. Acad. Sci. USA.

[8]  Hsinchun Chen,et al.  A knowledge-based approach to the design of document-based retrieval systems , 1990 .

[9]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[10]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[11]  H. Penny Nii,et al.  Blackboard systems: the blackboard model of problem solving and the evolution of blackboard architectures , 1995 .

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Herbert Van de Sompel,et al.  Reference Linking in a Hybrid Library Environment, Part 1: Frameworks for Linking , 1999, D Lib Mag..

[14]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[15]  H. Penny Nii,et al.  Blackboard Systems, Part One: The Blackboard Model of Problem Solving and the Evolution of Blackboard Architectures , 1986, AI Mag..

[16]  Herbert Van de Sompel,et al.  Reference Linking in a Hybrid Library Environment , 1999 .

[17]  Miguel Castro,et al.  Peer-to-Peer Systems IV, 4th International Workshop, IPTPS 2005, Ithaca, NY, USA, February 24-25, 2005, Revised Selected Papers , 2005, IPTPS.

[18]  Giles,et al.  Searching the world wide Web , 1998, Science.

[19]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[20]  Penny Nii The blackboard model of problem solving , 1986 .

[21]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.