Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-the-art solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n 2) to \(O(\overline{|D|} \cdot n\log n)\) (\(\overline{|D|}\): average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time.

[1]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[2]  Fabrizio Silvestri,et al.  Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[3]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[4]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[5]  Torsten Suel,et al.  Scalable techniques for document identifier assignment in inverted indexes , 2010, WWW '10.

[6]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[7]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[8]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[9]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[10]  Roi Blanco,et al.  Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem , 2005, SIGIR '05.

[11]  Fabrizio Silvestri,et al.  VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming , 2010, CIKM.

[12]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[13]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[14]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[15]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[16]  Alistair Moffat,et al.  Index compression using 64-bit words , 2010 .

[17]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[18]  Roi Blanco,et al.  Document Identifier Reassignment Through Dimensionality Reduction , 2005, ECIR.

[19]  R. Rice,et al.  Adaptive Variable-Length Coding for Efficient Compression of Spacecraft Television Data , 1971 .

[20]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).