Optimizing Distributed Join for Array Database System

With the sustained and rapid development of science and technology, the explosion of scientific data for analysis has brought the huge pressure. In order to reduce pressure, scientists use the array database instead of RDBMS to store and manage the scientific data. But according to our experiments, we find that the array database outperforms RDBMS on the simple queries but it can't support the complex multi-table join query very well. And because the network communication is the slowest component of multi-table join queries in distributed parallel databases, we introduce an optimized join algorithm that not only can minimize network communication by optimizing the transfer schedule, but also can reduce the CPU utilization, prevent it to become the bottleneck for the intensive computations. Our evaluation based on real scientific data and database shows the optimized algorithm adapts to diverse datasets and query types and it makes the array database outperforms RDBMS on multi-table queries of real workloads.