Using Perceptual Hash Algorithms to Identify Fragmented and Transformed Video Files

Over the last decades the amount of generated video content has increased exponentially. Easy access to video recording equipment and the Internet has given anyone the ability to create and share video material with the world almost instantaneously. With the enormous amount of content available the problem of managing it become relevant. In situations such as copyright control, media management or digital forensics there is a need to perform automatic video search. In this master thesis we investigate this problem. Using perceptual hash algorithms we create PYVIDID, a Python based video identification system able to match and search query videos to a large database. PYVIDID can also match both fragmented and transformed video files back to its original source. We also discuss possible application areas for a content based video identification system. Overall our results clearly shows that perceptual hash algorithms can indeed be used for video identification with high accuracy. We achieve good results regarding both accuracy and speed for both original, fragmented and transformed video files.

[1]  Robert F. Erbacher,et al.  SÁDI - Statistical Analysis for Data Type Identification , 2008, 2008 Third International Workshop on Systematic Approaches to Digital Forensic Engineering.

[2]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[3]  Li Chen,et al.  Video copy detection: a comparative study , 2007, CIVR '07.

[4]  Simson L. Garfinkel,et al.  File Fragment Classification-The Case for Specialized Approaches , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.

[5]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[6]  Nahid Shahmehri,et al.  Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages , 2006, SEC.

[7]  N. Shahmehri,et al.  File Type Identification of Data Fragments by Their Binary Structure , 2006, 2006 IEEE Information Assurance Workshop.

[8]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[9]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[10]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[11]  Yiming Yang,et al.  Statistical Learning for File-Type Identification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[12]  Zi Huang,et al.  UQLIPS: A Real-time Near-duplicate Video Clip Detection System , 2007, VLDB.