Space-efficient data structures for Top-k completion

Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.

[1]  Guoliang Li,et al.  Efficient type-ahead search on relational data: a TASTIER approach , 2009, SIGMOD Conference.

[2]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[3]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[4]  Guoliang Li,et al.  Supporting efficient top-k queries in type-ahead search , 2012, SIGIR '12.

[5]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[6]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[7]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[8]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[9]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[10]  David Richard Clark,et al.  Compact pat trees , 1998 .

[11]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[12]  Giuseppe Ottaviano,et al.  Fast Compressed Tries through Path Decompositions , 2011, ALENEX.

[13]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[14]  Dhruv Matani An O(k log n) algorithm for prefix based ranked autocomplete , 2021, ArXiv.

[15]  Meng He,et al.  Indexing Compressed Text , 2003 .

[16]  Roberto Grossi,et al.  Rank-Sensitive Data Structures , 2005, SPIRE.

[17]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[18]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[19]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[20]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[21]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[22]  Chen Li,et al.  Efficient top-k algorithms for fuzzy search in string collections , 2009, KEYS '09.

[23]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[24]  Kenneth Ward Church,et al.  K-Best Suffix Arrays , 2007, NAACL.

[25]  Rajeev Raman,et al.  Succinct Representations of Binary Trees for Range Minimum Queries , 2012, COCOON.

[26]  Sebastiano Vigna,et al.  Broadword Implementation of Rank/Select Queries , 2008, WEA.

[27]  H. V. Jagadish,et al.  Effective Phrase Prediction , 2007, VLDB.

[28]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[29]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[30]  Nieves R. Brisaboa,et al.  Compressed String Dictionaries , 2011, SEA.

[31]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[32]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[33]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[34]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.