Digital Preservation in Grids and Clouds: A Middleware Approach

Digital preservation is the persistent archiving of digital assets for future access and reuse, irrespective of the underlying platform and software solutions. Existing preservation systems have a strong focus on Grids, but the advent of cloud technologies offers an attractive option. We describe a middleware system that enables a flexible choice between a Grid and a cloud for ad-hoc computations that arise during the execution of a preservation workflow and also for archiving digital objects. The choice between different infrastructures remains open during the lifecycle of the archive, ensuring a smooth switch between different solutions to accommodate the changing requirements of the organization that needs its digital assets preserved. We also offer insights on the costs, running times, and organizational issues of cloud computing, proving that the cloud alternative is particularly attractive for smaller organizations without access to a Grid or with limited IT infrastructure.

[1]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[2]  Tatiana Kovacikova,et al.  Grid and Cloud Computing: Opportunities for Integration with the Next Generation Network , 2009, Journal of Grid Computing.

[3]  Mark Hedges,et al.  Rule-based curation and preservation of data: A data grid approach using iRODS , 2009, Future Gener. Comput. Syst..

[4]  Gonçalo Antunes,et al.  Addressing Digital Preservation: Proposals for New Perspectives , 2009 .

[5]  Mark Hedges,et al.  Modelling OAIS Compliance for Disaggregated Preservation Services , 2008, Int. J. Digit. Curation.

[6]  Sean Owen,et al.  Mahout in Action , 2011 .

[7]  Hervé Déjean Numbered sequence detection in documents , 2010, Electronic Imaging.

[8]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[9]  Neil Beagrie,et al.  Digital Curation for Science, Digital Libraries, and Individuals , 2008, Int. J. Digit. Curation.

[10]  Mike Thelwall,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[11]  Jean-Pierre Chanod,et al.  Xeproc(c): A Model-Based Approach towards Document Process Preservation , 2010, ECDL.

[12]  Mark Hedges,et al.  Management and preservation of research data with iRODS , 2007, CIMS '07.

[13]  Robert Sanderson,et al.  Grid-based digital libraries: cheshire3 and distributed retrieval , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[14]  Peter Wittek,et al.  Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints , 2012, International Journal on Digital Libraries.

[15]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[16]  Robert Wilensky,et al.  The multivalent browser: a platform for new ideas , 2001, DocEng '01.

[17]  Paul B. Watry Digital Preservation Theory and Application: Transcontinental Persistent Archives Testbed Activity , 2007, Int. J. Digit. Curation.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[20]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[21]  Elena Maceviciute,et al.  Assessing digital preservation frameworks: the approach of the SHAMAN project , 2009, MEDES.

[22]  Katherine Skinner,et al.  A Guide to Distributed Digital Preservation , 2010 .

[23]  Jean-Luc Meunier,et al.  On tables of contents and how to recognize them , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[24]  George Spanoudakis,et al.  Establishing and Monitoring SLAs in Complex Service Based Systems , 2009, 2009 IEEE International Conference on Web Services.

[25]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[26]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[27]  Robert Sanderson,et al.  Integrating data and text mining processes for digital library applications , 2007, JCDL '07.

[28]  A. Rajasekar,et al.  Integration of Cloud Storage with Data Grids , 2009 .

[29]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[30]  A. H. Ball,et al.  Briefing Paper: the OAIS Reference Model , 2006 .

[31]  Doug Tidwell XSLT - mastering XML transformations , 2001 .

[32]  Peter Wittek,et al.  XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[33]  Julie Allinson,et al.  OAIS as a reference model for repositories : an evaluation , 2006 .

[34]  Paul B. Watry,et al.  A No-Compromises Architecture for Digital Document Preservation , 2005, ECDL.

[35]  Morgan V. Cundiff An introduction to the Metadata Encoding and Transmission Standard (METS) , 2004 .

[36]  José Luis Borbinha,et al.  Using a Grid for Digital Preservation , 2008, ICADL.

[37]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[38]  Dimitrios Katsaros,et al.  Architectural Requirements for Cloud Computing Systems: An Enterprise Cloud Approach , 2011, Journal of Grid Computing.

[39]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[40]  Ramin Yahyapour,et al.  SLA@SOI - SLAs Empowering a Dependable Service Economy , 2010, ERCIM News.

[41]  Jonathan D. Cohen,et al.  Los Angeles, CA, USA , 2002 .

[42]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[43]  Robert Sanderson,et al.  Cheshire3: retrieving from tera-scale grid-based digital libraries , 2006, SIGIR.

[44]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[45]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[46]  Claudia Angelini,et al.  Analyzing the Whole Transcriptome by RNA-Seq Data: The Tip of the Iceberg , 2010, ERCIM News.

[47]  Peter Wittek,et al.  Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.