EPUR: An Efficient Parallel Update System over Large-Scale RDF Data

RDF is a standard model for data interchange on the web and is widely adopted for graph data management. With the explosive growth of RDF data, how to process RDF data incrementally and maximize the parallelism of RDF systems has become a challenging problem. The existing RDF data management researches mainly focus on parallel query, and rarely pay attention to the optimization of data storage and update. Also, the conventional parallel models for parallel query optimizations are not suitable for data update. Therefore, we propose a new design of an efficient parallel update system which is novel in three aspects. Firstly, the proposed design presents a new storage structure of RDF data and two kinds of indexes, which facilitates parallel processing. Secondly, the new design provides a general parallel task execution framework to maximize the parallelism of the system. Last but not least, parallel update operations are developed to handle incremental RDF data. Based on the innovations above, we implement an efficient parallel update system (EPUR). Extensive experiments show that EPUR outperforms RDF-3X, Virtuoso, PostgreSQL and achieves good scalability on the number of threads.