Efficient detection of large-scale redundancy in enterprise file systems

In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.

[1]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[2]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[3]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[4]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[5]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .