Methods and Experiences for Developing Abstractions for Data-intensive, Scientific Applications

Developing software for scientific applications that require the integration of diverse types of computing, instruments, and data present challenges that are distinct from commercial software. These applications require scale, and the need to integrate various programming and computational models with evolving and heterogeneous infrastructure. Pervasive and effective abstractions for distributed infrastructures are thus critical; however, the process of developing abstractions for scientific applications and infrastructures is not well understood. While theory-based approaches for system development are suited for well-defined, closed environments, they have severe limitations for designing abstractions for scientific systems and applications. The design science research (DSR) method provides the basis for designing practical systems that can handle real-world complexities at all levels. In contrast to theory-centric approaches, DSR emphasizes both practical relevance and knowledge creation by building and rigorously evaluating all artifacts. We show how DSR provides a well-defined framework for developing abstractions and middleware systems for distributed systems. Specifically, we address the critical problem of distributed resource management on heterogeneous infrastructure over a dynamic range of scales, a challenge that currently limits many scientific applications. We use the pilot-abstraction, a widely used resource management abstraction for high-performance, high throughput, big data, and streaming applications, as a case study for evaluating the DSR activities. For this purpose, we analyze the research process and artifacts produced during the design and evaluation of the pilot-abstraction. We find DSR provides a concise framework for iteratively designing and evaluating systems. Finally, we capture our experiences and formulate different lessons learned.

[1]  Shantenu Jha,et al.  Pilot-Data: An abstraction for distributed data , 2013, J. Parallel Distributed Comput..

[2]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[3]  Shantenu Jha,et al.  Middleware Building Blocks for Workflow Systems , 2019, Computing in Science & Engineering.

[4]  Jan Waller,et al.  Performance Benchmarking of Application Monitoring Frameworks , 2014, Softwaretechnik-Trends.

[5]  Mary Shaw,et al.  The coming-of-age of software architecture research , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[6]  Daniel S. Katz,et al.  Understanding Scientific Applications for Cloud Environments , 2011, CloudCom 2011.

[7]  Daniel S. Katz,et al.  Introducing distributed dynamic data‐intensive (D3) science: Understanding applications and infrastructure , 2016, Concurr. Comput. Pract. Exp..

[8]  Marinus J. Bouwman On conceptual modelling: Perspectives from artificial intelligence, databases, and programming languages: Michael L. BRODIE, John MYLOPOULOS and Joachim W. SCHMIDT (eds.) Topics in Information Systems, Springer, Berlin, 1984, xi + 510 pages, DM89.00 , 1986 .

[9]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[10]  Shantenu Jha,et al.  Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Shantenu Jha,et al.  Using Pilot Systems to Execute Many Task Workloads on Supercomputers , 2015, JSSPP.

[12]  Shantenu Jha,et al.  Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data , 2012, MapReduce '12.

[13]  Judith Segal,et al.  Models of scientific software development , 2008, CSE 2008.

[14]  Shantenu Jha,et al.  Pilot-Streaming: A Stream Processing Framework for High-Performance Computing , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[15]  Shantenu Jha,et al.  Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures? , 2015, ArXiv.

[16]  Geoffrey C. Fox,et al.  Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[17]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[18]  Alan R. Hevner,et al.  Design Science in Information Systems Research , 2004, MIS Q..

[19]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[20]  Herbert A. Simon,et al.  The Sciences of the Artificial , 1970 .

[21]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[22]  Geoffrey C. Fox,et al.  HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[23]  Shantenu Jha,et al.  Performance Characterization and Modeling of Serverless and HPC Streaming Applications , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[24]  Samir Chatterjee,et al.  A Design Science Research Methodology for Information Systems Research , 2008 .

[25]  J. Qiu 1 Towards HPC-ABDS : An Initial High-Performance Big Data Stack , 2014 .

[26]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[27]  H. Simon,et al.  The sciences of the artificial (3rd ed.) , 1996 .

[28]  Juhani Iivari,et al.  A Paradigmatic Analysis of Information Systems As a Design Science , 2007, Scand. J. Inf. Syst..

[29]  Geoffrey C. Fox,et al.  Towards an Understanding of Facets and Exemplars of Big Data Applications , 2014 .

[30]  John D. Leidel,et al.  Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity , 2018 .

[31]  Shantenu Jha,et al.  P∗: A model of pilot-abstractions , 2012, 2012 IEEE 8th International Conference on E-Science.

[32]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[33]  Jan vom Brocke,et al.  Evaluations in the Science of the Artificial - Reconsidering the Build-Evaluate Pattern in Design Science Research , 2012, DESRIST.

[34]  Shantenu Jha,et al.  SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[35]  Shantenu Jha,et al.  Developing autonomic distributed scientific applications: a case study from history matching using ensemblekalman-filters , 2009, GMAC '09.

[36]  Geoffrey C. Fox,et al.  Task-parallel Analysis of Molecular Dynamics Trajectories , 2018, ICPP.

[37]  Micah Beck,et al.  On the hourglass model , 2016, Commun. ACM.

[38]  Shantenu Jha,et al.  Scalable online comparative genomics of mononucleosomes: a BigJob , 2013, XSEDE.

[39]  Mary Shaw,et al.  An Introduction to Software Architecture , 1993, Advances in Software Engineering and Knowledge Engineering.

[40]  Margo I. Seltzer,et al.  The case for application-specific benchmarking , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[41]  Shantenu Jha,et al.  Efficient large-scale replica-exchange simulations on production infrastructure , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[42]  Shantenu Jha,et al.  A Comprehensive Perspective on Pilot-Job Systems , 2015, ACM Comput. Surv..

[43]  Gregor von Laszewski,et al.  Contributions to High-Performance Big Data Computing , 2019 .

[44]  Danilo Bzdok,et al.  Points of Significance: Statistics versus machine learning , 2018, Nature Methods.

[45]  David C. DiNucci,et al.  Design and implementation of parallel programs with LGDF2 , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[46]  Michael L. Brodie On conceptual modelling - perspectives from artificial intelligence, databases and programming languages , 1984, Topics in information systems.

[47]  Shantenu Jha,et al.  Synapse: Synthetic Application Profiler and Emulator , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[48]  Shantenu Jha,et al.  SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure , 2015 .

[49]  W. Buchholz,et al.  A Synthetic Job for Measuring System Performance , 1969, IBM Syst. J..

[50]  Joel Spolsky,et al.  The Law of Leaky Abstractions , 2004 .

[51]  Linh Ngo,et al.  Synthetic data generation for the internet of things , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[52]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[53]  Shantenu Jha,et al.  Distributed Application Runtime Environment (DARE): A Standards-based Middleware Framework for Science-Gateways , 2012, Journal of Grid Computing.

[54]  K. Eisenhardt Building theories from case study research , 1989, STUDI ORGANIZZATIVI.

[55]  K. K. Nambiar,et al.  Foundations of Computer Science , 2001, Lecture Notes in Computer Science.

[56]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[57]  David Lorge Parnas,et al.  Information Distribution Aspects of Design Methodology , 1971, IFIP Congress.

[58]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[59]  Shantenu Jha,et al.  Computing Clinically Relevant Binding Free Energies of HIV-1 Protease Inhibitors , 2014, Journal of chemical theory and computation.

[60]  W. R. Sutherland,et al.  The on-line graphical specification of computer procedures , 1966 .

[61]  Austin Henderson,et al.  Conceptual models: begin by designing what to design , 2002, INTR.

[62]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[63]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[64]  Bernd Bruegge,et al.  Object-Oriented Software Engineering Using UML, Patterns, and Java , 2009 .

[65]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[66]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[67]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[68]  Daniel S. Katz,et al.  Distributed computing practice for large‐scale science and engineering applications , 2013, Concurr. Comput. Pract. Exp..

[69]  Joshua J. Bloch How to design a good API and why it matters , 2006, OOPSLA '06.

[70]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[71]  Mark Crovella,et al.  Computer Systems Performance Evaluation , 2007 .

[72]  Michael Hauck Automated Experiments for Deriving Performance-relevant Properties of Software Execution Environments , 2013 .