Data-intensive architecture for scientific knowledge discovery

This paper presents a data-intensive architecture that demonstrates the ability to support applications from a wide range of application domains, and support the different types of users involved in defining, designing and executing data-intensive processing tasks. The prototype architecture is introduced, and the pivotal role of DISPEL as a canonical language is explained. The architecture promotes the exploration and exploitation of distributed and heterogeneous data and spans the complete knowledge discovery process, from data preparation, to analysis, to evaluation and reiteration. The architecture evaluation included large-scale applications from astronomy, cosmology, hydrology, functional genetics, imaging processing and seismology.

[1]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[2]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[3]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[4]  Óscar Corcho,et al.  Semantics and Optimization of the SPARQL 1.1 Federation Extension , 2011, ESWC.

[5]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[6]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[7]  Mario Antonioletti,et al.  Integrating distributed data sources with OGSA–DAI DQP and Views , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[8]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[9]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[10]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[11]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[12]  Daniel Crawl,et al.  Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis , 2010, Ecol. Informatics.

[13]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[14]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[15]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[16]  Oscar Corcho,et al.  Validation and mismatch repair of workflows through typed data streams , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[17]  G. Bruce Berriman,et al.  How Will Astronomy Archives Survive the Data Tsunami? , 2011, ACM Queue.

[18]  Peter Brezany,et al.  The Data Bonanza: Improving Knowledge Discovery in Science, Engineering, and Business , 2013 .

[19]  Francisco Curbera,et al.  Web Services Business Process Execution Language Version 2.0 , 2007 .

[20]  Murray Cole,et al.  Performance database: capturing data for optimizing distributed streaming workflows , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[21]  Anthony J. G. Hey,et al.  Jim Gray on eScience: a transformed scientific method , 2009, The Fourth Paradigm.

[22]  Alexander S. Szalay,et al.  Data-Intensive Computing in the 21st Century , 2008, Computer.

[23]  Yogesh L. Simmhan,et al.  The Trident Scientific Workflow Workbench , 2008, 2008 IEEE Fourth International Conference on eScience.

[24]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[25]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[26]  Xavier Llorà,et al.  Meandre: Semantic-Driven Data-Intensive Flows in the Clouds , 2008, 2008 IEEE Fourth International Conference on eScience.

[27]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[28]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[29]  Jano I. van Hemert,et al.  A generic parallel processing model for facilitating data mining and integration , 2011, Parallel Comput..

[30]  V. Curcin,et al.  Mining Adverse Drug Reactions with E-Science Workflows , 2008, 2008 Cairo International Biomedical Engineering Conference.

[31]  M. Atkinson,et al.  ADMIRE D2.9 – Final report on the ADMIRE architecture, with an assessment and proposals for its development , 2011 .

[32]  Geoffrey C. Fox,et al.  Granules: A lightweight, streaming runtime for cloud computing with support, for Map-Reduce , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[33]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[34]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[35]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[36]  V. Curcin,et al.  Scientific workflow systems - can one size fit all? , 2008, 2008 Cairo International Biomedical Engineering Conference.

[37]  J DeWittDavid,et al.  Scientific data management in the coming decade , 2005 .

[38]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..