Dataflow Processing and Optimization on Grid and Cloud Infrastructures

Complex on-demand data retrieval and processing is a characteristic of several applications and combines the notions of querying & search, information filtering & retrieval, data transformation & analysis, and other data manipulations. Such rich tasks are typically represented by data processing graphs, having arbitrary data operators as nodes and their producer-consumer interactions as edges. Optimizing and executing such graphs on top of distributed architectures is critical for the success of the corresponding applications and presents several algorithmic and systemic challenges. This paper describes a system under development that offers such functionality on top of Ad-hoc Clusters, Grids, or Clouds. Operators may be user defined, so their algebraic and other properties as well as those of the data they produce are specified in associated profiles. Optimization is based on these profiles, must satisfy a variety of objectives and constraints, and takes into account the particular characteristics of the underlying architecture, mapping high-level dataflow semantics to flexible runtime structures. The paper highlights the key components of the system and outlines the major directions of its development.

[1]  Yannis E. Ioannidis,et al.  Autonomic Query Allocation based on Microeconomics Principles , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Norman W. Paton,et al.  A new Architecture for OGSA-DAI , 2005 .

[3]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[4]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[5]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[6]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[7]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[8]  Luis Rodero-Merino,et al.  A break in the clouds: towards a cloud definition , 2008, CCRV.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Jim Smith,et al.  Service-Based Distributed Querying on the Grid , 2003, ICSOC.

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Mark Pruett,et al.  Yahoo! pipes , 2007 .