Powering Archive Store Query Processing via Join Indices

In recent years, the industry landscape surrounding data processing systems has been significantly impacted by Big Data. Core technology and algorithms for data analysis have been adjusted and redesigned to handle the ever increasing amount of data. In this paper we revisit the concept of join index, a base mechanism in relational DBMS to support the expensive join operator, and analyze how it can be effectively integrated and combined with other mechanisms widely deployed for large-scale data processing. In particular, we show how the data store Informatica IDV, originally designed to facilitate backup and archival of application data, can benefit from join indices to give fast SQL-based access to archival data for discovery purposes. Informatica IDV supports both horizontal and vertical partitioning – two mechanisms that are widely used in modern data stores to speed up large-scale data processing. However, this requires us to reexamine join index design and usage. In this paper, we propose a scalable, partitioned, columnar join index that supports parallel execution, ease of maintenance and a late materialization query processing approach which is efficient for column-stores. Our implementation based on Informatica IDV has been evaluated using a TPC-H based benchmark, showing significant performance improvements compared to executions without join index. CCS Concepts •Information systems → Join algorithms;

[1]  Theo Härder Implementing a generalized access path structure for a relational database system , 1978, TODS.

[2]  Kenneth A. Ross,et al.  Fast joins using join indices , 1999, The VLDB Journal.

[3]  Nancy L. Martin,et al.  Join index, materialized view, and hybrid-hash join: a performance analysis , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[4]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[5]  Philip A. Bernstein,et al.  A multi-level architecture for relational data base systems , 1975, VLDB '75.

[6]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[7]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[8]  Peter Boncz,et al.  UvA-DARE ( Digital Academic Repository ) Monet ; a next-Generation DBMS Kernel For Query-Intensive Applications , 2007 .

[9]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[10]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[11]  Daniel J. Abadi,et al.  Performance tradeoffs in read-optimized databases , 2006, VLDB.

[12]  David J. DeWitt,et al.  Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Alon Itai,et al.  Maintenance of views , 1984, SIGMOD '84.

[14]  Shan Wang,et al.  Improving performance by creating a native join-index for OLAP , 2011, Frontiers of Computer Science in China.

[15]  Bipin C. Desai Performance of a Composite Attribute and Join Index , 1989, IEEE Trans. Software Eng..

[16]  Stanley Y. W. Su,et al.  An Evaluation of Relational Join Algorithms in a Pipelined Query Processing Environment , 1988, IEEE Trans. Software Eng..

[17]  Goetz Graefe,et al.  Multi-table joins through bitmapped join indices , 1995, SGMD.

[18]  Stefan Manegold,et al.  Cache-Conscious Radix-Decluster Projections , 2004, VLDB.