Ginix: Generalized inverted index for keyword search

Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in datasets using two scalable algorithms. Experiments on the performance and scalability of Ginix on real datasets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes.

[1]  Guoliang Li,et al.  Interactive search in XML data , 2009, WWW '09.

[2]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[3]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  W. Jack Bouknight,et al.  A procedure for generation of three-dimensional half-toned computer graphics presentations , 1970, CACM.

[5]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[6]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[7]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[8]  Roi Blanco,et al.  TSP and cluster-based solutions to the reassignment of document identifiers , 2006, Information Retrieval.

[9]  Torsten Suel,et al.  Scalable techniques for document identifier assignment in inverted indexes , 2010, WWW '10.

[10]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[11]  Alistair Moffat,et al.  Improved word-aligned binary compression for text indexing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Beng Chin Ooi,et al.  EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data , 2008, SIGMOD Conference.

[14]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[15]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[16]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  H. Edelsbrunner A new approach to rectangle intersections , 2010 .

[18]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[19]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[20]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[21]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[22]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.