Efficient Common Items Extraction from Multiple Sorted Lists

Given a set of lists, where items of each list are sorted by the ascending order of their values, the objective of this paper is to figure out the common items that appear in all of the lists efficiently. This problem is sometimes known as common items extraction from sorted lists. To solve this problem, one common approach is to scan all items of all lists sequentially in parallel until one of the lists is exhausted. However, we observe that if the overlap of items across all lists is not high, such sequential access approach can be significantly improved. In this paper, we propose two algorithms, MergeSkip and MergeESkip, to solve this problem by taking the idea of skipping as many items of lists as possible. As a result, a large number of comparisons among items can be saved, and hence the efficiency can be improved. We conduct extensive analysis of our proposed algorithms on one real dataset and two synthetic datasets with different data distributions. We report all our findings in this paper.

[1]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[2]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[3]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[4]  Lee A. Hollaar An architecture for the efficient combining of linearly ordered lists , 1976, SIGF.

[5]  William H. Stellhorn,et al.  An Inverted File Processor for Information Retrieval , 1977, IEEE Transactions on Computers.

[6]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[7]  David J. DeWitt,et al.  Read-optimized databases, in depth , 2008, Proc. VLDB Endow..

[8]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[9]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[10]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[11]  Lee Allen Hollaar A list merging processor for inverted file information retrieval systems. , 1975 .

[12]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[14]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.