Efficient Fuzzy Search in Large Text Collections

We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new pre-processing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.

[1]  Guoliang Li,et al.  Supporting efficient top-k queries in type-ahead search , 2012, SIGIR '12.

[2]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[3]  Prasenjit Mitra,et al.  Query suggestions in the absence of query logs , 2011, SIGIR.

[4]  W. Bruce Croft,et al.  Automatic boolean query suggestion for professional search , 2011, SIGIR.

[5]  Leonid Boytsov,et al.  Indexing methods for approximate dictionary searching: Comparative analysis , 2011, JEAL.

[6]  Yang Song,et al.  Optimal rare query suggestion with implicit user feedback , 2010, WWW '10.

[7]  Fabrizio Silvestri,et al.  Aging effects on query flow graphs for query suggestion , 2009, CIKM.

[8]  Gonzalo Navarro,et al.  Approximate String Matching with Compressed Indexes , 2009, Algorithms.

[9]  Gonzalo Navarro,et al.  Indexing Variable Length Substrings for Exact and Approximate Matching , 2009, SPIRE.

[10]  Surajit Chaudhuri,et al.  Extending autocompletion to tolerate errors , 2009, SIGMOD Conference.

[11]  Djamal Belazzougui,et al.  Faster and Space-Optimal Edit Distance "1" Dictionary , 2009, CPM.

[12]  Guoliang Li,et al.  Efficient interactive fuzzy keyword search , 2009, WWW '09.

[13]  H. Bast,et al.  Fast error-tolerant search on very large texts , 2009, SAC '09.

[14]  Francesco Bonchi,et al.  Query suggestions using query-flow graphs , 2009, WSCD '09.

[15]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[16]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[17]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[18]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Wei Gao,et al.  Cross-lingual query suggestion using query logs of different languages , 2007, SIGIR.

[20]  Ryen W. White,et al.  Query suggestion based on user landing pages , 2007, SIGIR.

[21]  Paolo Ferragina,et al.  Compressed permuterm index , 2007, SIGIR.

[22]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[23]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[24]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[25]  Gonzalo Navarro,et al.  On the Least Cost for Proximity Searching in Metric Spaces , 2006, WEA.

[26]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[27]  Fei Shi,et al.  A New Indexing Method for Approximate Search in Text Databases , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[28]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[29]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.

[30]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[31]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[32]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[33]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[34]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[35]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[36]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[37]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[38]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[39]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[40]  G. Navarro,et al.  Indexing Methods for Approximate String Matching , 2001, IEEE Data Eng. Bull..

[41]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[42]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[43]  Gaston H. Gonnet,et al.  A fast algorithm on average for all-against-all sequence matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[44]  Udi Manber,et al.  Approximate Multiple Strings Search , 1996, CPM.

[45]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[46]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[47]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[48]  M. W. Du,et al.  An Approach to Designing Very Fast Approximate String Matching Algorithms , 1994, IEEE Trans. Knowl. Data Eng..

[49]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[50]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1993, STOC.

[51]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[52]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[53]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[54]  Raymond J. D'Amore,et al.  One-time complete indexing of text: theory and practice , 1985, SIGIR '85.

[55]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[56]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[57]  Aviezri S. Fraenkel,et al.  A hash code method for detecting and correcting spelling errors , 1982, CACM.

[58]  Paul Bratley,et al.  Processing truncated terms in document retrieval systems , 1982, Inf. Process. Manag..

[59]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[60]  Derek Partridge,et al.  Adaptive correction of program statements , 1973, Commun. ACM.

[61]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[62]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[63]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[64]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[65]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.