Cloud Computing: A Digital Libraries Perspective

Provisioning and maintenance of infrastructure for Web based digital library search engines such as CiteSeer$^x$ present several challenges. CiteSeer$^x$ provides autonomous citation indexing, full text indexing, and extensive document metadata from document scrawled from the web across computer and information sciences and related fields. Infrastructure virtualization and cloud computing are particularly attractive choices for CiteSeer$^x$, which is challenged by both growth in the size of the indexed document collection, new features and most prominently usage. In this paper, we discuss constraints and choices faced by information retrieval systems like CiteSeer$^x$ by exploring in detail aspects of placing CiteSeer$^x$ into current cloud infrastructure offerings. We also implement an ad-hoc virtualized storage system for experimenting with adoption of cloud infrastructure services. Our results show that a cloud implementation of CiteSeer$^x$ may be a feasible alternative for its continued operation and growth

[1]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[2]  Dmitrii Zagorodnov,et al.  Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems , 2008 .

[3]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[4]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[5]  Steffen Staab,et al.  Semantic Web and Peer-to-Peer - Decentralized Management and Exchange of Knowledge and Information , 2006 .

[6]  B. Cesnik,et al.  Digital Libraries , 2001, Yearbook of Medical Informatics.

[7]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[9]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[10]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[15]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[16]  Peter Mika,et al.  Web Semantics in the Clouds , 2008, IEEE Intelligent Systems.

[17]  Emily Halili,et al.  Apache JMeter , 2008 .