Faster Set Intersection Algorithms for Text Searching ? University of Waterloo Technical Report CS-2007-13

The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several i mproved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted array s in the intersection context. We perform an experimental comparison with the algorithms from the previous studies fr om Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition , we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized varian t [SAGA 2003]. We consider both the random data-set from Baeza-Yates and Salinger, the Google queries used by Demainet al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency que ry stream, along with its own query log. We measure the performance both in terms of the number of compar isons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed.