CiteSeer x : a cloud perspective

Information retrieval applications are good candidates for hosting in a cloud infrastructure. CiteSeerx a digital library and search engine was built with the goal of efficiently disseminating scientific information and literature over the web. The framework for CiteSeerx as an application of the SeerSuite software is a design built with extensibility and scalability as fundamental features. This loosely coupled architecture with service oriented interfaces allows the whole or parts of SeerSuite to readily be placed in the cloud. We discuss in brief the architecture, approaches, and advantages of hosting CiteSeerx in the cloud. We present initial results on costs of migrating whole or parts of CiteSeerx to two popular cloud offerings as well as discuss the effort involved.

[1]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[2]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[3]  Eugene Ciurana,et al.  Google App Engine , 2009 .

[4]  Arun Venkataramani,et al.  Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges , 2010, HotCloud.

[5]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[6]  Mudhakar Srivatsa,et al.  Search-as-a-service: Outsourced search over outsourced storage , 2009, TWEB.

[7]  Herbert Van de Sompel,et al.  Resource Harvesting within the OAI-PMH Framework , 2004, D Lib Mag..

[8]  Johannes Gehrke,et al.  Large-scale collaborative analysis and extraction of web data , 2008, Proc. VLDB Endow..

[9]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[11]  C. Lee Giles,et al.  Cloud Computing: A Digital Libraries Perspective , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[12]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Peter Mika,et al.  Web Semantics in the Clouds , 2008, IEEE Intelligent Systems.

[14]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[15]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[16]  Walter Brisken,et al.  To Lease or Not to Lease from Storage Clouds , 2010, Computer.