SD-HDFS: Secure Deletion in Hadoop Distributed File System

Sensitive information that is stored in Hadoop clusters can potentially be retrieved without permission or access granted. In addition, the ability to recover deleted data from Hadoop clusters represents a major security threat. Hadoop clusters are used to manage large amounts of data both within and outside of organizations. As a result, it has become important to be able to locate and remove data effectively and efficiently. In this paper, we propose Secure Delete, a holistic framework that propagates file information to the block management layer via an auxiliary communication path. The framework tracks down undeleted data blocks and modifies the normal deletion operation in the Hadoop Distributed File System (HDFS). We introduce CheckerNode, which generates a summary report from all DataNodes and compares the block information with the metadata from the NameNode. If the metadata do not contain the entries for the data blocks, unsynchronized blocks are automatically deleted. However, deleted data could still be recovered using digital forensics tools. We also describe a novel secure deletion technique in HDFS that generates a random pattern and writes multiple times to the disk location of the data block.

[1]  Srdjan Capkun,et al.  SoK: Secure Data Deletion , 2013, 2013 IEEE Symposium on Security and Privacy.

[2]  Steven Bauer,et al.  Secure Data Deletion for Linux File Systems , 2001, USENIX Security Symposium.

[3]  Nikolai Joukov,et al.  Adding secure deletion to your favorite file system , 2005, Third IEEE International Security in Storage Workshop (SISW'05).

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  Justin Marshall,et al.  TrueErase: per-file secure deletion for the storage data path , 2012, ACSAC '12.

[6]  Yookun Cho,et al.  Secure deletion for NAND flash file system , 2008, SAC '08.

[7]  Simson L. Garfinkel,et al.  Carving contiguous and fragmented files with fast object validation , 2007, Digit. Investig..

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Jongmoo Choi,et al.  Models and Design of an Adaptive Hybrid Scheme for Secure Deletion of Data in Consumer Electronics , 2008, IEEE Transactions on Consumer Electronics.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Eugene H. Spafford,et al.  Getting Physical with the Digital Investigation Process , 2003, Int. J. Digit. EVid..

[12]  Pascal Felber,et al.  Evaluating the Price of Consistency in Distributed File Storage Services , 2013, DAIS.

[13]  Melissa Dark,et al.  Toward a Data Spillage Prevention Process in Hadoop using Data Provenance , 2015, CLHS '15.

[14]  Joe Sremack Big Data Forensics – Learning Hadoop Investigations , 2015 .

[15]  Adam Yee,et al.  HFAA: a generic socket API for Hadoop file systems , 2012, ASBD '12.

[16]  X Li,et al.  Guidelines for Media Sanitization , 2006 .

[17]  Ping Huang,et al.  FINGER: A novel erasure coding scheme using fine granularity blocks to improve Hadoop write and update performance , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[18]  Srdjan Capkun,et al.  Secure Deletion on Log-structured File Systems , 2011, ArXiv.

[19]  Jianwu Wang,et al.  Provenance for MapReduce-based data-intensive workflows , 2011, WORKS '11.

[20]  Peter Gutmann,et al.  Secure deletion of data from magnetic and solid-state memory , 1996 .