论文信息 - Column-stores vs. row-stores: how different are they really?

Column-stores vs. row-stores: how different are they really?

There has been a significant amount of excitement and recent work on column-oriented database systems ("column-stores"). These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems ("row-stores") on analytical workloads such as those found in data warehouses, decision support, and business intelligence applications. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query. This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach.

[1] Philip A. Bernstein,et al. Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[2] Daniel J. Abadi,et al. Query execution in column-oriented database systems , 2008 .

[3] Daniel J. Abadi,et al. Column oriented Database Systems , 2009, Proc. VLDB Endow..

[4] Michael Stonebraker,et al. C-Store: A Column-oriented DBMS , 2005, VLDB.

[5] Marcin Zukowski,et al. MonetDB/X100 - A DBMS In The CPU Cache , 2005, IEEE Data Eng. Bull..

[6] Marcin Zukowski,et al. Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7] David J. DeWitt,et al. Weaving Relations for Cache Performance , 2001, VLDB.

[8] Daniel J. Abadi,et al. Performance tradeoffs in read-optimized databases , 2006, VLDB.

[9] Don S. Batory,et al. On searching transposed files , 1978, ACM Trans. Database Syst..

[10] David J. DeWitt,et al. Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11] Daniel J. Abadi,et al. Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[12] Marcin Zukowski,et al. MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[13] Anastasia Ailamaki,et al. QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[14] Ramesh C. Agarwal,et al. Block oriented processing of relational database operations in modern computer architectures , 2001, Proceedings 17th International Conference on Data Engineering.

[15] Kenneth A. Ross,et al. Buffering databse operations for enhanced instruction cache performance , 2004, SIGMOD '04.

[16] Goetz Graefe,et al. Multi-table joins through bitmapped join indices , 1995, SGMD.

[17] Goetz Graefe. Efficient columnar storage in B-trees , 2007, SGMD.

[18] Goetz Graefe,et al. Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[19] Xuedong Chen,et al. Adjoined Dimension Column Clustering to Improve Data Warehouse Query Performance , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20] Andreas Weininger. Efficient execution of joins in a star schema , 2002, SIGMOD '02.

[21] David J. DeWitt,et al. A Comparison of C-Store and Row-Store in a Common Framework , 2006 .

[22] Martin L. Kersten,et al. MIL primitives for querying a fragmented world , 1999, The VLDB Journal.

[23] Patrick Valduriez,et al. A query processing strategy for the decomposed storage model , 1987, 1987 IEEE Third International Conference on Data Engineering.