Fuzzy-Fingerprints for Text-Based Information Retrieval

Thispaper introduces aparticular form offuzzy-fingerprints—their construction, their interpretation, and their use in the field of information retrieval. Though the concept of finger- printing in general is not new, the way of using them within a similarity search as described here is: Instead of computing the similarity between two fingerprints in order to access the similarity between the associated objects, simply the event of a fingerprint collision is used for a similarity assessment. The main impact of this approach is the small number of comparisons necessary to conduct a similarity search.

[1]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[2]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[3]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[4]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[5]  Nivio Ziviani,et al.  Syntactic similarity of Web documents , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[6]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[7]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[8]  Stephen Blott,et al.  An Approximation- Based Data Structure for Similarity Search , 2006 .

[9]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[10]  Lutz Prechelt,et al.  JPlag: Finding plagiarisms among a set of programs , 2000 .

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Verzekeren Naar Sparen,et al.  Cambridge , 1969, Humphrey Burton: In My Own Time.