Pegasus: Mapping Large-Scale Workflows to Distributed Resources

Many scientific advances today are derived from analyzing large amounts of data. The computations themselves can be very complex and consume significant resources. Scientific efforts are also not conducted by individual scientists; rather, they rely on collaborations that encompass many researchers from various organizations. The analysis is often composed of several individual application components designed by different scientists. To describe the desired analysis, the components are assembled in a workflow where the dependencies between them are defined and the data needed for the analysis are identified. To support the scale of the applications, many resources are needed in order to provide adequate performance. These resources are often drawn from a heterogeneous pool of geographically distributed compute and data resources. Running large-scale, collaborative applications in such environments has many challenges. Among them are systematic management of the applications, their components, and the data, as well as successful and efficient execution on the distributed resources.