论文信息 - Optimizing Distributed Join for Array Database System

Optimizing Distributed Join for Array Database System

With the sustained and rapid development of science and technology, the explosion of scientific data for analysis has brought the huge pressure. In order to reduce pressure, scientists use the array database instead of RDBMS to store and manage the scientific data. But according to our experiments, we find that the array database outperforms RDBMS on the simple queries but it can't support the complex multi-table join query very well. And because the network communication is the slowest component of multi-table join queries in distributed parallel databases, we introduce an optimized join algorithm that not only can minimize network communication by optimizing the transfer schedule, but also can reduce the CPU utilization, prevent it to become the bottleneck for the intensive computations. Our evaluation based on real scientific data and database shows the optimized algorithm adapts to diverse datasets and query types and it makes the array database outperforms RDBMS on multi-table queries of real workloads.

Ming Zhu | Hui Li | Jing Li | Mei Chen

[1] Michael Stonebraker,et al. Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2] Tony Hey,et al. The Fourth Paradigm , 2009 .

[3] Martin L. Kersten,et al. Distribution Rules for Array Database Queries , 2005, DEXA.

[4] Michael Stonebraker,et al. SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[5] David J. DeWitt,et al. Scientific data management in the coming decade , 2005, SGMD.

[6] Kenneth A. Ross,et al. Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[7] Michael Stonebraker,et al. A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[8] Paul G. Brown,et al. Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[9] Dovi Poznanski,et al. SciDB for High-Performance Array-Structured Science Data at NERSC , 2015, Computing in Science & Engineering.

[10] Michael Stonebraker,et al. Requirements for Science Data Bases and SciDB , 2009, CIDR.

[11] Magdalena Balazinska,et al. ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[12] Ying Zhang,et al. SciQL, a query language for science applications , 2010, AD '11.

[13] Peter Z. Kunszt,et al. Data Mining the SDSS SkyServer Database , 2002, WDAS.