Scalable techniques for document identifier assignment in inverted indexes

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

[1]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[2]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[3]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[4]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[5]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[6]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[7]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[8]  Guy E. Blelloch,et al.  Index compression through document reordering , 2002, Proceedings DCC 2002. Data Compression Conference.

[9]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[10]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[11]  Fabrizio Silvestri,et al.  Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[12]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  David S. Johnson,et al.  Compressing Large Boolean Matrices using Reordering Techniques , 2004, VLDB.

[15]  Roi Blanco,et al.  Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem , 2005, SIGIR '05.

[16]  Roi Blanco,et al.  Document Identifier Reassignment Through Dimensionality Reduction , 2005, ECIR.

[17]  S. Héman Super-Scalar Database Compression between RAM and CPU Cache , 2005 .

[18]  Torsten Suel,et al.  Approximate maximum weight branchings , 2006, Inf. Process. Lett..

[19]  David L. Applegate,et al.  The traveling salesman problem , 2006 .

[20]  Roi Blanco,et al.  TSP and cluster-based solutions to the reassignment of document identifiers , 2006, Information Retrieval.

[21]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[23]  William J. Cook,et al.  The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics) , 2007 .

[24]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[25]  William J. Cook,et al.  The Traveling Salesman Problem: A Computational Study , 2007 .

[26]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[27]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[28]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[29]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[30]  A. Schrijver,et al.  The Traveling Salesman Problem , 2011 .