HPChain: An MPI-Based Blockchain Framework for Data Fidelity in High-Performance Computing Systems

Data fidelity is of prominent importance for scientific experiments and simulations, as the data upon which scientific discovery rests must be trustworthy and retain its veracity at every point in the scientific workflow. The state-of-the-art mechanism to ensure data fidelity is through data provenance, which keeps track of the data changes and allows for auditing and reproducing scientific discoveries. However, the provenance data itself may as well exhibit unintentional human errors and malicious data manipulation. To enable a trustworthy and reliable data fidelity service, we advocate achieving the immutability and decentralization of scientific data provenance through blockchains. The challenges of leveraging blockchains in high-performance computing (HPC) are two folds. Firstly, the HPC infrastructure exhibits incompatible characteristics to the targeting platform of existing blockchain systems; Secondly, HPC’s programming model MPI alone cannot meet the reliability requirements expected by blockchains. To this end, we propose HPChain, a new blockchain framework specially designed for HPC systems. HPChain employs a new consensus protocol compatible with and optimized for HPC systems. Furthermore, HPChain was implemented with MPI and integrated with an off-chain distributed provenance service to tolerate the failures caused by faulty MPI ranks. The HPChain prototype system has been deployed to 500 cores at the University of Nevada’s HPC center and demonstrated strong resilience and scalability while outperforming state-of-theart blockchains by orders of magnitude; we are working on deploying HPChain to the Cori supercomputer hosted at the Lawrence Berkeley National Laboratory. ACM Reference Format: Abdullah Al-Mamun, Tonglin Li, Mohammad Sadoghi, Linhua Jiang, Haoting Shen, and Dongfang Zhao. 2019. HPChain: An MPI-Based Blockchain Framework for Data Fidelity in High-Performance Computing Systems. In Proceedings of ACM Conference (SC’19). ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SC’19, November 2019, Denver, CO, USA © 2019 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 MOTIVATION Data fidelity is of prominent importance for scientific experiments and simulations, as the data upon which scientific discovery rests must be trustworthy and retain its veracity at every point in the scientific workflow. Scientific data might be intentionally fabricated or falsified, might be invalidated due to system failures, or might be accidentally modified due to human errors. Regardless of the root cause, the resultant data is not trustworthy and leads to inaccurate or incorrect scientific conclusions. As a case in point, the National Cancer Institute found 0.25% of trial data are fraudulent in the year of 2015 [4]. In earth sciences, scientists emphasized the importance of maintaining data provenance in achieving the transparency of scientific discoveries [18]. The de facto way to audit and reproduce scientific research and data is through data provenance, which tracks the entire lifespan of the data during the experiments and simulation at various phases such as data creation, data changes, and data archival. Conventional provenance systems can be categorized into two types: centralized provenance systems and distributed provenance systems. One representative centralized provenance system is SPADE [7], where the provenance (from various data sources) is collected and managed by a centralized relational database. Domain-specific systems based on such centralized design are also available in biomedical engineering [2], computational chemistry [13], to name a few. Although having been reasonably adopted by various disciples, the centralized provenance systems are being increasingly criticized by researchers and scientists who face the exponentially-grown data in terms of velocity and volume, the so-called “Big Data.” In essence, the centralized system, due to the performance bottleneck on the centralized node (not to mention its potential single point of failure), cannot meet the performance expectation of many dataintensive scientific applications and to this end, we started witnessing the boom of various distributed approaches toward scalable provenance [3, 22]. Indeed, those distributed provenance systems, mostly built upon distributed file systems as opposed to centralized databases, eliminated the performance bottleneck and delivered orders of magnitude higher performance than centralized approaches. As a double-edged sword, however, distributed provenance systems pose a new concern on the provenance itself: the chance that the provenance is tempered with increases from f % to n · f %, where f indicates the failure rate of a single node and n the total number of nodes. Moreoever, a natural question then is, while the provenance is supposed to audit the execution of the application, who then should audit the provenance? Do we need to build the provenance of provenance? So the recursion goes on and on, indefinitely. To this end, decentralized provenance systems were recently proposed inspired by blockchains. These systems (e.g., ProvChain [10], SmartProvenance [15]) are also called blockchain-based provenance systems that are both temper-evident and autonomous, thus guarantee trustworthy data provenance. Multiple issues exist for applying a blockchain-based provenance system to HPC. Firstly, it is not hard to see that its space efficiency is low, its network traffic consumption is high, and the consistency is always a challenging problem. Besides, all these blockchain-based provenance systems are built in such a way that the underlying blockchain infrastructure is a black box and the provenance service works as a higher-level application by calling the programming interfaces provided by the blockchain infrastructure such as Hyperledger Fabric [8] and Ethereum [6]. In the best case, the provenance services might miss optimization and customization opportunities because the former cannot (or, prohibitively expensive and complicated to) modify the lower blockchain layer; to make it worse, the applicability of those blockchain-based provenance systems is constrained by the underlying blockchain infrastructure.

[1]  Mohammad Sadoghi,et al.  In-memory Blockchain: Toward Efficient and Trustworthy Data Provenance for HPC Systems , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[2]  Carole A. Goble,et al.  Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications , 2013, Journal of Biomedical Semantics.

[3]  Murat Kantarcioglu,et al.  SmartProvenance: A Distributed, Blockchain Based DataProvenance System , 2018, CODASPY.

[4]  Beng Chin Ooi,et al.  BLOCKBENCH: A Framework for Analyzing Private Blockchains , 2017, SIGMOD Conference.

[5]  Dongfang Zhao,et al.  Toward Accurate and Efficient Emulation of Public Blockchains in the Cloud , 2019, CLOUD.

[6]  Chen Shou,et al.  Distributed data provenance for large-scale data-intensive computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Sachin Shetty,et al.  ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[8]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[9]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  Robert B. Ross,et al.  Lightweight Provenance Service for High-Performance Computing , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Ashish Gehani,et al.  SPADE: Support for Provenance Auditing in Distributed Environments , 2012, Middleware.

[12]  Robert B. Ross,et al.  Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  Alvin Cheung,et al.  Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads , 2016, Proc. VLDB Endow..