SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web

SeerSuite is a framework for scientific and academic digital libraries and search engines built by crawling scientific and academic documents from the web with a focus on providing reliable, robust services. In addition to full text indexing, SeerSuite supports autonomous citation indexing and automatically links references in research articles to facilitate navigation, analysis and evaluation. SeerSuite enables access to extensive document, citation, and author metadata by automatically extracting, storing and indexing metadata. SeerSuite also supports MyCiteSeer, a personal portal that allows users to monitor documents, store user queries, build document portfolios, and interact with the document metadata. We describe the design of SeerSuite and the deployment and usage of CiteSeerx as an instance of SeerSuite.

[1]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[2]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[3]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[4]  Herbert Van de Sompel,et al.  Open Archives Initiative - Protocol for Metadata Harvesting - v.2.0 , 2002 .

[5]  Paul Ginsparg,et al.  Can Peer Review Be Better Focused? , 2002 .

[6]  Robert Wilensky,et al.  A framework for distributed digital object services , 2006, International Journal on Digital Libraries.

[7]  C. Lee Giles,et al.  Distributed error correction , 1999, DL '99.

[8]  A. Cawkell Science Citation Index , 1970, Nature.

[9]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[10]  H. Penny Nii,et al.  Blackboard Systems, Part One: The Blackboard Model of Problem Solving and the Evolution of Blackboard Architectures , 1986, AI Mag..

[11]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[12]  E Garfield,et al.  "Science Citation Index"--A New Dimension in Indexing. , 1964, Science.

[13]  Xiangmin Zhang,et al.  Rule-based word clustering for document metadata extraction , 2005, SAC '05.

[14]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[15]  C. Lee Giles,et al.  Metadata extraction and indexing for map search in web documents , 2008, CIKM '08.

[16]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[17]  Hui Han,et al.  CiteSeer-API: towards seamless resource location and interlinking for digital libraries , 2004, CIKM '04.

[18]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[19]  Penny Nii The blackboard model of problem solving , 1986 .

[20]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[21]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[22]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[23]  C. Lee Giles,et al.  ChemXSeer: a digital library and data repository for chemical kinetics , 2007, CIMS '07.