Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search
暂无分享,去创建一个
We develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles.1 This index--along with a search interface--fits on a conventional CD (0.7 gigabyte). The ingredients of our indexing technology are similarity hashing and minimal perfect hashing.
[1] Yoshiharu Kohayakawa,et al. A Practical Minimal Perfect Hashing Method , 2005, WEA.
[2] Benno Stein,et al. Fuzzy-Fingerprints for Text-Based Information Retrieval , 2005 .