Cost Optimization of Data Flows Based on Task Re-ordering

Analyzing big data in a highly dynamic environment becomes more and more critical because of the increasingly need for end-to-end processing of this data. Modern data flows are quite complex and there are not efficient, cost-based, fully-automated, scalable optimization solutions that can facilitate flow designers. The state-of-the-art proposals fail to provide near optimal solutions even for simple data flows. To tackle this problem, we introduce a set of approximate algorithms for defining the execution order of the constituent tasks, in order to minimize the total execution cost of a data flow. We also present the advantages of the parallel execution of data flows. We validated our proposals in both a real tool and synthetic flows and the results show that we can achieve significant speed-ups, moving much closer to optimal solutions.

[1]  Hongwei Huang,et al.  E-Novo: An Automated Workflow for Efficient Structure-Based Lead Optimization , 2009, J. Chem. Inf. Model..

[2]  Yannis E. Ioannidis,et al.  Query optimization , 1996, CSUR.

[3]  Erich Schikuta,et al.  Grid Workflow Optimization Regarding Dynamically Changing Resources and Conditions , 2007, GCC.

[4]  Yannis Manolopoulos,et al.  Decentralized execution of linear workflows over web services , 2011, Future Gener. Comput. Syst..

[5]  Edward So,et al.  PEM: a framework enabling continual optimization of workflow process executions based upon business value metrics , 2005, 2005 IEEE International Conference on Services Computing (SCC'05) Vol-1.

[6]  Radu Prodan,et al.  Performance and cost optimization for multiple large-scale grid workflow applications , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[8]  Georgia Kougka,et al.  On Optimizing Workflows Using Query Processing Techniques , 2012, SSDBM.

[9]  Amar H. Patel,et al.  An RFID and Wireless Sensor Network-based Implementation of Workflow Optimization , 2007, 2007 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks.

[10]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2003, Distributed and Parallel Databases.

[11]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[12]  Dick H. J. Epema,et al.  Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds , 2013, Future Gener. Comput. Syst..

[13]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[14]  Yves Robert,et al.  Mapping Filtering Streaming Applications , 2010, Algorithmica.

[15]  Georgia Kougka,et al.  Declarative Expression and Optimization of Data-Intensive Flows , 2013, DaWaK.

[16]  Doron Rotem,et al.  An Algorithm to Generate all Topological Sorting Arrangements , 1981, Computer/law journal.

[17]  G. Höfner,et al.  Data integration , 1993 .

[18]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[19]  Panos Vassiliadis,et al.  Towards a Benchmark for ETL Workflows , 2007, QDB.

[20]  Toshihide Ibaraki,et al.  On the optimal nesting order for computing N-relational joins , 1984, TODS.

[21]  Jennifer Widom,et al.  Query optimization over web services , 2006, VLDB.

[22]  Yang Shanlin,et al.  Data oriented analysis of workflow optimization , 2000, Proceedings of the 3rd World Congress on Intelligent Control and Automation (Cat. No.00EX393).

[23]  Georgia Kougka,et al.  Optimization of Data-intensive Flows: Is it Needed? Is it Solved? , 2014, DOLAP '14.

[25]  Georgia Kougka,et al.  Practical algorithms for execution engine selection in data flows , 2015, Future Gener. Comput. Syst..

[26]  Dirk Reith,et al.  GROW: A gradient-based optimization workflow for the automated development of molecular models , 2010, Comput. Phys. Commun..

[27]  Rajkumar Buyya,et al.  Meeting Deadlines of Scientific Workflows in Public Clouds with Tasks Replication , 2014, IEEE Transactions on Parallel and Distributed Systems.

[28]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[29]  Rajiv M. Dewan,et al.  Workflow optimization through task redesign in business information processes , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[30]  Felix Naumann,et al.  SOFA: An extensible logical optimizer for UDF-heavy data flows , 2015, Inf. Syst..

[31]  Joseph M. Hellerstein,et al.  Optimization techniques for queries with expensive methods , 1998, TODS.

[32]  Lou Somers,et al.  Document Workflow Optimization , 2002 .

[33]  Lican Huang,et al.  A workflow portal supporting multi-language interoperation and optimization: Research Articles , 2007, Grid 2007.

[34]  Lisa Hellerstein,et al.  Parallel pipelined filter ordering with precedence constraints , 2012, TALG.

[35]  Erich Schikuta,et al.  Grid Workflow Optimization Regarding Dynamically Changing Resources and Conditions , 2007, Sixth International Conference on Grid and Cooperative Computing (GCC 2007).

[36]  Ioana Manolescu,et al.  Query optimization in the presence of limited access patterns , 1999, SIGMOD '99.

[37]  Jennifer Widom,et al.  Database systems - the complete book (2. ed.) , 2009 .

[38]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[39]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[40]  HuiYou Chang,et al.  Optimization of Workflow Resources Allocation with Cost Constraint , 2006, CSCWD.

[41]  Nitin Kumar,et al.  An Efficient Heuristic for Logical Optimization of ETL Workflows , 2010, BIRTE.

[42]  Dick H. J. Epema,et al.  Cost-driven scheduling of grid workflows using Partial Critical Paths , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[43]  Matthias Boehm,et al.  Cost-based optimization of integration flows , 2010 .

[44]  U. Srivastava,et al.  Ordering Pipelined Query Operators with Precedence Constraints , 2005 .

[45]  Sriram Padmanabhan,et al.  Determining Essential Statistics for Cost Based Optimization of an ETL Workflow , 2014, EDBT.

[46]  Surajit Chaudhuri,et al.  An overview of business intelligence technology , 2011, Commun. ACM.

[47]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[48]  Jun Zhang,et al.  An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[49]  Bruce I. Reiner,et al.  Workflow Optimization: Current Trends and Future Directions , 2002, Journal of Digital Imaging.

[50]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..

[51]  Bertram Ludäscher,et al.  Scientific workflow design with data assembly lines , 2009, WORKS '09.

[52]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[53]  R. Prodan,et al.  Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact , 2012, IEEE Transactions on Parallel and Distributed Systems.

[54]  Christopher Olston,et al.  Generating example data for dataflow programs , 2009, SIGMOD Conference.

[55]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[56]  Bernhard Mitschang,et al.  An Approach to Optimize Data Processing in Business Processes , 2007, VLDB.

[57]  Radu Prodan,et al.  A Truthful Dynamic Workflow Scheduling Mechanism for Commercial Multicloud Environments , 2013, IEEE Transactions on Parallel and Distributed Systems.

[58]  Jeffrey D. Ullman,et al.  Optimizing Large Join Queries in Mediation Systems , 1999, ICDT.

[59]  Stewart S. Miller Parallel Databases , 2001, High-Performance Web Databases.

[60]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[61]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[62]  Joel H. Saltz,et al.  An Integrated Framework for Parameter-based Optimization of Scientific Workflows. , 2009, Proceedings of the ... International Symposium on High Performance Distributed Computing.

[63]  Yves Robert,et al.  Scheduling algorithms for linear workflow optimization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[64]  Kevin Wilkinson,et al.  HFMS: Managing the lifecycle and complexity of hybrid analytic data flows , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[65]  Surajit Chaudhuri,et al.  Towards a robust query optimizer: a principled and practical approach , 2005, SIGMOD '05.