论文信息 - Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers

Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-the-art solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n 2) to \(O(\overline{|D|} \cdot n\log n)\) (\(\overline{|D|}\): average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time.

Liang Shi | Bin Wang

[1] Peter Ingwersen,et al. Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[2] Fabrizio Silvestri,et al. Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[3] Alistair Moffat,et al. Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[4] Hugh E. Williams,et al. Compressing Integers for Fast File Access , 1999, Comput. J..

[5] Torsten Suel,et al. Scalable techniques for document identifier assignment in inverted indexes , 2010, WWW '10.

[6] Guy E. Blelloch,et al. Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[7] Hugh E. Williams,et al. Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[8] Fabrizio Silvestri,et al. Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[9] Torsten Suel,et al. Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[10] Roi Blanco,et al. Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem , 2005, SIGIR '05.

[11] Fabrizio Silvestri,et al. VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming , 2010, CIKM.