A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset

Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.

[1]  Rafael Asenjo,et al.  Load balancing using work-stealing for pipeline parallelism in emerging applications , 2009, ICS.

[2]  Satyajayant Misra,et al.  On distributed file tree walk of parallel file systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Ralph C. Merkle,et al.  Protocols for Public Key Cryptosystems , 1980, 1980 IEEE Symposium on Security and Privacy.

[4]  Erez Zadok,et al.  Ensuring data integrity in storage: techniques and applications , 2005, StorageSS '05.

[5]  Paul Z. Kolano,et al.  High Performance Multi-Node File Copies and Checksums for Clustered File Systems , 2010, LISA.

[6]  Saurabh Gupta,et al.  Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Traian Muntean,et al.  An Efficient Parallel Algorithm for Skein Hash Functions , 2010, IACR Cryptol. ePrint Arch..

[8]  Palash Sarkar,et al.  A Parallel Algorithm for Extending Cryptographic Hash Functions , 2001, INDOCRYPT.

[9]  Seref Sagiroglu,et al.  Big data: A review , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[10]  Sumit Narayan,et al.  Uncovering errors: the cost of detecting silent data corruption , 2009, PDSW '09.