Time-bound analytic tasks on large datasets through dynamic configuration of workflows

Domain experts are often untrained in big data technologies and this limits their ability to exploit the data they have available. Workflow systems hide the complexities of high-end computing and software engineering by offering pre-packaged analytic steps combined into multi-step methods commonly used by experts. A current limitation of workflow systems is that they do not take into account user deadlines: they run workflows selected by the user, but take their time to do so. This is impractical when large datasets are at stake, since users often prefer to see an answer faster even if it has lower precision or quality. In this paper, we present an extension to workflow systems that enables them to take into account user deadlines by automatically generating alternative workflow candidates and ranking them according to performance estimates. The system makes these estimates based on workflow performance models created from workflow executions, and uses semantic technologies to reason about workflow options. Possible workflow candidates are presented to the user in a compact manner, and are ranked according to their runtime estimates. We have implemented this approach in the WOOT system, which combines and extends capabilities from the WINGS semantic workflow system and the Apache OODT Object Oriented Data Technology and workflow execution system.

[1]  Nenad Medvidovic,et al.  A software architecture-based framework for highly distributed and data intensive scientific applications , 2006, ICSE.

[2]  Paul T. Groth,et al.  Expressive Reusable Workflow Templates , 2009, 2009 Fifth IEEE International Conference on e-Science.

[3]  Yolanda Gil,et al.  A semantic framework for automatic generation of computational workflows using distributed data and component catalogues , 2011, J. Exp. Theor. Artif. Intell..

[4]  Paul T. Groth,et al.  Wings: Intelligent Workflow-Based Design of Computational Experiments , 2011, IEEE Intelligent Systems.

[5]  Jukka Zitting,et al.  Tika in Action , 2011 .

[6]  Michael Laurenzano,et al.  How well can simple metrics represent the performance of HPC applications? , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[7]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[8]  Sean Kelly,et al.  A Reusable Process Control System Framework for the Orbiting Carbon Observatory and NPP Sounder PEATE Missions , 2009, 2009 Third IEEE International Conference on Space Mission Challenges for Information Technology.

[9]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[10]  Ian J. Taylor,et al.  A General Approach to Real-Time Workflow Monitoring , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[11]  Ralph Bergmann,et al.  Similarity assessment and efficient retrieval of semantic workflows , 2014, Inf. Syst..

[12]  J Montagnat,et al.  Workflow-based comparison of two Distributed Computing Infrastructures , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[13]  Dennis V. Lindley,et al.  An Introduction to Bayesian Inference and Decision , 1974 .

[14]  Yan Liu,et al.  A Framework for Efficient Data Analytics through Automatic Configuration and Customization of Scientific Workflows , 2011, 2011 IEEE Seventh International Conference on eScience.

[15]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[16]  Rahul Ramachandran,et al.  Talkoot software appliance for collaborative science , 2009, 2009 IEEE International Geoscience and Remote Sensing Symposium.

[17]  Joel H. Saltz,et al.  Parameterized specification, configuration and execution of data-intensive scientific workflows , 2010, Cluster Computing.

[18]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[19]  Kevin Leyton-Brown,et al.  Algorithm Runtime Prediction: The State of the Art , 2012, ArXiv.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Carole A. Goble,et al.  Workflow discovery: the problem, a case study from e-Science and a graph-based solution , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[22]  Nenad Medvidovic,et al.  Scientific Software as Workflows: From Discovery to Distribution , 2008, IEEE Software.

[23]  Yan Liu,et al.  Making data analysis expertise broadly accessible through workflows , 2011, WORKS '11.

[24]  Paolo Missier,et al.  Predicting the Execution Time of Workflow Activities Based on Their Input Features , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[25]  Robert L. Winkler,et al.  An Introduction to Bayesian Inference and Decision , 1972 .

[26]  Yolanda Gil,et al.  A new approach for publishing workflows: abstractions, standards, and linked data , 2011, WORKS '11.

[27]  Gregor von Laszewski,et al.  Performance metrics and auditing framework using application kernels for high‐performance computer systems , 2013, Concurr. Comput. Pract. Exp..