Encouraging collaboration through a new data management approach

The ability to store large volumes of data is increasing faster than processing power. Some existing data management methods often result in data loss, inaccessibility or repetition of simulations. We propose a framework which promotes collaboration and simplifies data management. In particular we have demonstrated the proposed framework in the scenario of handling large scale data generated from biomolecular simulations in a multiinstitutional global collaboration. The framework has extended the ability of the Python problem solving environment to manage data files and metadata associated with simulations. We provide a transparent and seamless environment for user submitted code to analyse and post-process data stored in the framework. Based on this scenario we have further enhanced and extended the framework to deal with the more generic case of enabling any existing data file to be post processed from any .NET enabled programming language.

[1]  Simon J. Cox,et al.  Grid Enabled Optimisation and Design Search (Geodise) , 2002 .

[2]  B. Gladman,et al.  Security Engineering: a Guide to Building Dependable Distributed Systems Physical Tamper Resistance 14.1 Introduction , 2022 .

[3]  Kaihsu Tai,et al.  Grid computing and biomolecular simulation , 2005, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[4]  Stuart Murdock,et al.  BioSimGrid: towards a worldwide repository for biomolecular simulations. , 2004, Organic & biomolecular chemistry.

[5]  Paul DuBois,et al.  MySQL Reference Manual , 2002 .

[6]  Steven J. Johnston,et al.  Integrating data management into engineering applications , 2003 .

[7]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[8]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[9]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[10]  Simon J. Cox,et al.  The GRID: Computational and data resource sharing in engineering optimisation and design search , 2001, Proceedings International Conference on Parallel Processing Workshops.

[11]  Donald D. Chamberlin,et al.  SEQUEL: A structured English query language , 1974, SIGFIDET '74.

[12]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[13]  Peter J Bond,et al.  The simulation approach to bacterial outer membrane proteins (Review) , 2004, Molecular membrane biology.

[14]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[15]  Simon J. Cox,et al.  Performing Grid Computation with Enhanced Web Service and Service Invocation Technologies , 2003, International Conference on Computational Science.

[16]  Jonathan W. Essex,et al.  Security and BioSimGrid: A Biomolecular Simulation Database , 2004 .

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[19]  Michael Gao,et al.  DB2(R) SQL PL: Essential Guide for DB2(R) UDB on Linux(TM), UNIX(R), Windows(TM), i5/OS(TM), and z/OS(R) (2nd Edition) , 2004 .

[20]  Peter A. Kollman,et al.  AMBER: Assisted model building with energy refinement. A general program for modeling molecules and their interactions , 1981 .

[21]  Jonathan W. Essex,et al.  BioSimGrid: Grid-enabled biomolecular simulation data storage and analysis , 2006, Future Gener. Comput. Syst..

[22]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[23]  Francois Yergeau UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[24]  Douglas K. Barry,et al.  Web Services and Service-Oriented Architecture: The Savvy Manager's Guide , 2003 .

[25]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[26]  Alan Gordon Net and COM Interoperability Handbook , 2002 .

[27]  Arun Jagatheesan,et al.  Real Experiences with Data Grids - Case studies in using the SRB , 2002 .

[28]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[29]  James S. Tiller A technical guide to IPSec virtual private networks , 2000 .

[30]  Edward Levinson The MIME Multipart/Related Content-type , 1995, RFC.

[31]  Klaus Schulten,et al.  A system for interactive molecular dynamics simulation , 2001, I3D '01.

[32]  Daniel A. Menascé,et al.  Composing Web Services: A QoS View , 2004, IEEE Internet Comput..

[33]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[34]  Jonathan W. Essex,et al.  Efficient data storage and analysis for generic biomolecular simulation data , 2004 .

[35]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[36]  D. van der Spoel,et al.  GROMACS: A message-passing parallel molecular dynamics implementation , 1995 .

[37]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[38]  Stuart Murdock,et al.  Non-commercial Research and Educational Use including without Limitation Use in Instruction at Your Institution, Sending It to Specific Colleagues That You Know, and Providing a Copy to Your Institution's Administrator. All Other Uses, Reproduction and Distribution, including without Limitation Comm , 2022 .

[39]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[40]  DONALD MICHIE,et al.  “Memo” Functions and Machine Learning , 1968, Nature.

[41]  Jonathan W. Essex,et al.  BioSimGrid: a distributed database for biomolecular simulations , 2003 .

[42]  Gavin J. Pringle,et al.  Scalable Eigensolvers on HPCx : Case Studies , 2005 .

[43]  Oliver Beckstein,et al.  LARGE SCALE BIOMOLECULAR SIMULATIONS : CURRENT STATUS AND FUTURE PROSPECTS , 2003 .

[44]  Mahmut T. Kandemir,et al.  Studying storage-recomputation tradeoffs in memory-constrained embedded processing , 2005, Design, Automation and Test in Europe.

[45]  Peter Z. Kunszt,et al.  The SDSS skyserver: public access to the sloan digital sky server data , 2001, SIGMOD '02.

[46]  Sam R. Alapati,et al.  Expert Oracle 9i Database Administration , 2003 .

[47]  Brad Abrams,et al.  Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries , 2005 .

[48]  Simon J. Coles,et al.  Synthesis and X-ray crystal structures of organotri(2-furyl)phosphonium salts: effects of 2-furyl substituents at phosphorus on intramolecular nitrogen to phosphorus hypervalent coordinative interactions , 2004 .

[49]  Karen Schuchardt,et al.  The Extensible Computational Chemistry Environment: A Problem Solving Environment for High Performance Theoretical Chemistry , 2003, International Conference on Computational Science.

[50]  Ian T. Foster The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Euro-Par.

[51]  Steven J. Johnston,et al.  Managing Large Volumes of Distributed Scientific Data , 2008, ICCS.

[52]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[53]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[54]  Dave J. Beckett,et al.  The design and implementation of the redland RDF application framework , 2001, WWW '01.

[55]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[56]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[57]  G Stix,et al.  The triumph of the light. , 2001, Scientific American.

[58]  Rafael Dueire Lins,et al.  Garbage collection: algorithms for automatic dynamic memory management , 1996 .

[59]  Jeffrey Richter Applied Microsoft .NET Framework Programming , 2002 .

[60]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[61]  Jim Gray,et al.  To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? , 2007, ArXiv.

[62]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[63]  Hemma Prafullchandra,et al.  Going Beyond the Sandbox: An Overview of the New Security Architecture in the Java Development Kit 1.2 , 1997, USENIX Symposium on Internet Technologies and Systems.

[64]  Lars Powers,et al.  Visual Basic Programmer's Guide to the .NET Framework Class Library , 2002 .

[65]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[66]  Ken Henderson,et al.  The Guru's Guide to SQL Server Stored Procedures, Xml, and HTML with Cdrom , 2001 .

[67]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions , 1992 .

[68]  Dominic Giampaolo,et al.  Practical File System Design with the Be File System , 1998 .

[69]  Christopher Hertel Implementing CIFS: The Common Internet File System , 2003 .

[70]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[71]  Lynn Cooley,et al.  Flytrap, a database documenting a GFP protein-trap insertion screen in Drosophila melanogaster , 2004, Nucleic Acids Res..

[72]  D. Britton GridPP : Meeting the Particle Physics Computing Challenge , 2005 .

[73]  D. A. Thompson,et al.  The Future of Magnetic Data Storage Technology , 2000 .

[74]  Tim Menzies,et al.  Data Mining for Very Busy People , 2003, Computer.

[75]  Bing Wu,et al.  A Web/grid portal implementation of BioSimGrid: a biomolecular simulation database , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[76]  Tony Sammes,et al.  Forensic computing: a practitioner's guide , 2000 .

[77]  Craig E. Tull,et al.  The global unified parallel file system (GUPFS) project: FY 2003 activities and results , 2004 .

[78]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[79]  Robert E. McGrath,et al.  The NCSA astronomy digital image library: from data archiving to data publishing , 1999, Future Gener. Comput. Syst..

[80]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[81]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[82]  Reagan Moore,et al.  Towards the Interoperability of Web, Database, and Mass Storage Technologies for Petabyte Archives , 1996 .

[83]  G. W. Small Spectrometric Identification of Organic Compounds , 1992 .

[84]  Jonathan W. Essex,et al.  Towards a grid-enabled biomolecular simulation database , 2005 .

[85]  Melissa Craft,et al.  MCSE Self-Paced Training Kit (Exam 70-294) Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure , 2003 .

[86]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.