Grid Workflow Challenges in Computational Chemistry

Abstract—Traditionally, access to high-end clusters and thedevelopment of computational codes optimized for the clusterenvironment have been the limiting factor for high performancecomputing in many scientific domains. However, over the pastfew years, the commoditization of cluster hardware, the develop-ment of cluster management software (e.g., ROCKS, OSCAR),the standardization of applications software and their high-performance capabilities (MPI, CONDOR), and the developmentof service oriented infrastructures, have made access to large-scale computing commonplace in many disciplines. We arguethat the challenge in today’s environment is the integration ofcapabilities across multiple separate independent applications,including access to public datasets, and the creation and pub-lication of explicit workflows. In this paper, we describe thisspecific challenge as it relates to supporting novel science beingconducted in the domain of Computational Chemistry. I. I NTRODUCTION As Grid Computing technology has matured and its adop-tion in scientific disciplines has advanced, the technical chal-lenges have evolved from an initial focus on algorithm designfor parallel and clustered systems to more social challengessuch as data publication and search, metadata specification andstandardization, and tool integration. The technology trendsthat have enabled this change include: wide adoption of easy-to-use and easy-to-manage commodity clusters (Rocks [27],OSCAR [4]); wide availability of HPC libraries (MPI) andjob scheduling middleware (Globus [24], Sun Grid Engine[11], Portable Batch System [22]); the development of easilydeployed grid security components (GAMA [17], PURSE [6]);and the application of service-oriented-architectures (SOAs)and its constituent technologies – WSDL for service descrip-tion, SOAP for communication, portals for end-user environ-ments (GridSphere [8], OGCE [3], Jetspeed [1], GEMSTONE[7]). These technologies taken together allow grid computinginfrastructure for any domain to be built rapidly and costeffectively.Computational Science, in the meantime, is also evolvingfrom isolated city-states [29] of research towards a multi-disciplinary integrated view across different scientific scalesand integrating multiple toolsets. In [20], Foster describesan evolution towards a service-oriented science: “scientificresearch enabled by distributed networks of interoperating ser-vices”. For example, large scale biological endeavors, such asthe NIH-funded National Biomedical Computation Resource(NBCR) [9] describes this as multi-scale modeling and lists itas one of its core challenges.With the current state of technology, a user who wishesto leverage multiple applications in a particular scientificeffort must exert significant effort in training for each ap-plication, and must develop his or her own mechanisms forinteroperability across the applications. As the number ofapplications increases, this effort can easily dominate theday-to-day activities of the user. Some of the training effortis semantic in nature; that is, the proper use of scientificapplications requires a good understanding of the underlyingscience associated with the applications. However, many ofthe bottlenecks, we argue, are arbitrary in nature relatingto commandline variances, input file formats, data formats,poor or missing documentation, or simply historical accident.These are often not science-based, but related to the particularimplemention or runtime environment for the applications. Weterm these syntatic bottlenecks and focus our infrastructureefforts here.Service-oriented architectures (SOA), workflow systems andcomponent architectures are all technologies being developedto, in part, support the integration of separate applications.However, important differences exist. SOAs provide program-matic access to applications and can work in concert withworkflow systems to provide strongly typed dataflow envi-ronment. Many workflow systems support SOAs, includingKepler [12] and Taverna [19], while others support gridapplications running within Globus or Condor [13], [18].Component architectures such as the Common ComponentArchitecture [14] integrate applications in a much tighter way,but require significant code changes and integration effort.Our focus in this paper is on understanding the requirementsof a workflow system leveraging an underlying SOA for appli-cations. The SOA is being built as part of some combined ef-forts, including those of the UniZH group scientific efforts, theNSF-NMI team effort, and efforts of the National BiomedicalComputation Resource (NBCR) infrastructure. The first partof this paper focuses on the particular workflow challenges inthe domain of Computational Chemistry, illustrating as a casestudy the tools and procedures needed to leverage communitydatasets, utility programs and both quantum mechanical andSDSC TR-2006-7

[1]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[2]  Michael J. Holst,et al.  Numerical solution of the nonlinear Poisson–Boltzmann equation: Developing more robust and efficient methods , 1995, J. Comput. Chem..

[3]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004 .

[4]  Kaizar Amin,et al.  GridAnt: A Grid Workflow System , 2003 .

[5]  Ian Foster,et al.  The Globus toolkit , 1998 .

[6]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[7]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[8]  Haruki Nakamura,et al.  PDBML: the representation of archival macromolecular structure data in XML , 2005, Bioinform..

[9]  I. Foster,et al.  Service-Oriented Science , 2005, Science.

[10]  Philip M. Papadopoulos,et al.  NPACI Rocks: tools and techniques for easily deploying manageable Linux clusters , 2003, Concurr. Comput. Pract. Exp..

[11]  Ian J. Taylor,et al.  WSPeer - an interface to Web service hosting and invocation , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Scott R. Kohn,et al.  Toward a Common Component Architecture for High-Performance Scientific Computing , 1999, HPDC.

[13]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[14]  Sandeep Chandra,et al.  GAMA: grid account management architecture , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[15]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[16]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the World Wide Web. 4. CML Schema , 2003, J. Chem. Inf. Comput. Sci..