Effectiveness of file-based deduplication in digital forensics

Over the last decades, the increasing amount of storage became a pressing problem for forensic investigators. This is caused by the computerization of everyday life and the associated increasing number of different devices in typical households. Considering multi-terabyte storage on the suspects' side, even more storage requirements emerge on the side of the investigator for secure backup and working copies. In this paper, we improve the standardized forensic process by proposing to rigorously use file deduplication across devices as well as file whitelisting in investigations in order to reduce the amount of data that needs to be stored for analysis as early as during data acquisition. These improvements happen in an automatic fashion and are completely transparent to the forensic investigator. They may furthermore be added without negative effects to the chain of custody or artifact validity in court and are evaluated in a realistic use case. Additionally, we illustrate the effectivity of our proposed approach on a real-world corpus by showing a notable reduction in number of reduced files as well as storage. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Edgar R. Weippl,et al.  Gradually Improving the Forensic Process , 2015, 2015 10th International Conference on Availability, Reliability and Security.

[2]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[3]  M.H. Kryder,et al.  After Hard Drives—What Comes Next? , 2009, IEEE Transactions on Magnetics.

[4]  Edgar R. Weippl,et al.  Social snapshots: digital forensics for online social networks , 2011, ACSAC '11.

[5]  Michael Cohen,et al.  PyFlag - An advanced network forensic framework , 2008, Digit. Investig..

[6]  Timothy Grance,et al.  Guide to Integrating Forensic Techniques into Incident Response , 2006 .

[7]  Gary C. Kessler,et al.  Android forensics: Simplifying cell phone examinations , 2010 .

[8]  J. Steehler Understanding Moore's Law—Four Decades of Innovation (David C. Brock, ed.) , 2007 .

[9]  Kim-Kwang Raymond Choo,et al.  An integrated conceptual digital forensic framework for cloud computing , 2012, Digit. Investig..

[10]  Neil C. Rowe,et al.  Testing the National Software Reference Library , 2012, Digit. Investig..

[11]  Emiliano De Cristofaro,et al.  Practical Private Set Intersection Protocols with Linear Complexity , 2010, Financial Cryptography.

[12]  Simson L. Garfinkel,et al.  Advanced Forensic Format: An Open, Extensible Format for Disk Imaging , 2006 .

[13]  Richard M. Sneider,et al.  The Matrix* Data Base Management System , 1979, Computer.

[14]  Vassil Roussev,et al.  Automated evaluation of approximate matching algorithms on real data , 2014, Digit. Investig..

[15]  Simson L. Garfinkel,et al.  Digital forensics research: The next 10 years , 2010, Digit. Investig..

[16]  Simson L. Garfinkel,et al.  Distinct Sector Hashes for Target File Detection , 2012, Computer.

[17]  Simson L. Garfinkel,et al.  Digital media triage with bulk data analysis and bulk_extractor , 2013, Comput. Secur..

[18]  Steve Mead,et al.  Unique file identification in the National Software Reference Library , 2006, Digit. Investig..

[19]  Golden G. Richard,et al.  Rapid Forensic Acquisition of Large Media with Sifting Collectors , 2015 .

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Harald Baier,et al.  Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2 , 2012, ICDF2C.

[22]  Golden G. Richard,et al.  FACE: Automated digital evidence discovery and correlation , 2008, Digit. Investig..

[23]  Neil C. Rowe,et al.  Finding Anomalous and Suspicious Files from Directory Metadata on a Large Corpus , 2011, ICDF2C.

[24]  Tom Killalea,et al.  Guidelines for Evidence Collection and Archiving , 2002, RFC.

[25]  Vassil Roussev,et al.  Evaluating detection error trade-offs for bytewise approximate matching algorithms , 2014, Digit. Investig..

[26]  Kang Li,et al.  ClickMiner: Towards Forensic Reconstruction of User-Browser Interactions from Network Traces , 2014, CCS.

[27]  Jan Camenisch,et al.  Private Intersection of Certified Sets , 2009, Financial Cryptography.

[28]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[29]  Golden G. Richard,et al.  Rapid forensic imaging of large disks with sifting collectors , 2015, Digit. Investig..

[30]  Simson L. Garfinkel,et al.  Digital forensics XML and the DFXML toolset , 2012, Digit. Investig..

[31]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[32]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..