Efficient Indexing of Versioned Document Sequences

Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.

[1]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Eugene W. Myers,et al.  A file comparison program , 1985, Softw. Pract. Exp..

[5]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[6]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[7]  Alberto Apostolico,et al.  String Editing and Longest Common Subsequences , 1997, Handbook of Formal Languages.

[8]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Fabrizio Luccio,et al.  Compressing and searching XML data via two zips , 2006, WWW '06.

[11]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[12]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[13]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[14]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[15]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[16]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.