Applying syntactic similarity algorithms for enterprise information management

For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.

[1]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[2]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[3]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[4]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[5]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[6]  Bart Preneel,et al.  Hash functions , 2005, Encyclopedia of Cryptography and Security.

[7]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[8]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[9]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[10]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[11]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[12]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[13]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[14]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[15]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[16]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[17]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[18]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[19]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.