SUPRA: a sampling-query optimization method for large-scale OLAP

Relational online analytical processing (ROLAP) reduces the amount of storage required for maintaining various sizes of data cubes by materializing only parts of them in a lazy evaluation manner. In ROLAP however, cube creation queries need to be issued repeatedly in order to search for useful features (i.e. rules or patterns) within large scale databases. The cube creation cost can be a bottleneck in the whole ROLAP processing. The cost of the queries can be effectively reduced by estimating the query results using samples. To maintain the accuracy of ROLAP even when using samples, the samples need to be extracted in an appropriate unit. However, conventional query optimization methods only support record based sampling and cannot be applied for complex queries that have other sampling units, such as the ones that include grouping aggregate operations. We develop a query optimization method named SUPRA that preserves the sampling unit used in random data extraction. The method is designed to preserve both the sampling unit and the randomness of the sampling operation. Using this method, typical ROLAP queries can be transformed into more efficient ones than those obtained through conventional methods.