Efficient Error-tolerant Query Autocompletion

Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance from the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes. In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.

[1]  Jun-Ichi Aoe An Efficient Digital Search Algorithm by Using a Double-Array Structure , 1989, IEEE Trans. Software Eng..

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[4]  Ian H. Witten,et al.  The Reactive Keyboard: A Predicive Typing Aid , 1990, Computer.

[5]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[6]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[7]  Jaime Teevan,et al.  Large scale query log analysis of re-finding , 2010, WSDM '10.

[8]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[9]  Guoliang Li,et al.  Efficient fuzzy full-text type-ahead search , 2011, The VLDB Journal.

[10]  Leonid Boytsov,et al.  Indexing methods for approximate dictionary searching: Comparative analysis , 2011, JEAL.

[11]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[12]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[13]  Lu Wang,et al.  Clustering query refinements by user intent , 2010, WWW '10.

[14]  Tobias Scheffer,et al.  Sentence Completion , 1921, SIGIR '04.

[15]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[16]  H. V. Jagadish,et al.  Effective Phrase Prediction , 2007, VLDB.

[17]  Guoliang Li,et al.  An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[18]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[19]  Ricardo A. Baeza-Yates,et al.  Improving search engines by query clustering , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Aviezri S. Fraenkel,et al.  A hash code method for detecting and correcting spelling errors , 1982, CACM.

[21]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  Qi He,et al.  Web Query Recommendation via Sequential Query Prediction , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Jan Daciuk Comparison of construction algorithms for minimal, acyclic, deterministic, finite-state automata from sets of strings , 2002, CIAA'02.

[25]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[26]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[27]  Surajit Chaudhuri,et al.  Extending autocompletion to tolerate errors , 2009, SIGMOD Conference.

[28]  Laks V. S. Lakshmanan,et al.  SOCQET: semantic OLAP with compressed cube and summarization , 2003, SIGMOD '03.

[29]  Dekel Tsur Fast index for approximate string matching , 2010, J. Discrete Algorithms.

[30]  R. Ewy,et al.  ABSTRACT , 1986 .

[31]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.

[32]  Guoliang Li,et al.  Efficient interactive fuzzy keyword search , 2009, WWW '09.

[33]  Guoliang Li,et al.  Efficient fuzzy type-ahead search in TASTIER , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[34]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[35]  Ziv Bar-Yossef,et al.  Context-sensitive query auto-completion , 2011, WWW.