A Flexible N-Triples Loader for Hadoop

The wide adoption of the RDF data model demands efficient and scalable query processing strategies. For this purpose distributed programming paradigms such as Apache Spark on top of Hadoop are increasingly being used. Unfortunately, the Hadoop ecosystem lacks support for Semantic Web standards, e.g. reading an RDF serialization format, and thus, bringing in RDF data still requires a large amount of effort. We therefore present PRoST-loader, an application which, given a set of N-Triples documents, creates logical partitions according to three widely adopted strategies: Triple Table (TT), Wide Property Table (WPT) with a single row for each subject, and Vertical Partitioning (VP). Each strategy has its own advantages and limitations depending on the data characteristics and the task to be carried out. The loader thus leaves the strategy choice to the data engineer. The tool combines the flexibility of Spark, the deserialization capabilities of Hive, as well as the compression power of Apache Parquet at the storage layer. We managed to process Dbpedia (approx. 257M triples) in 3.5 min for TT, in approx 3.1 days for VP, and in 16.8 min for WPT with up to 1,114 columns in a cluster with moderate resources. In this paper we aim to present the strategies followed, but also to expose the community to this open-source tool, which facilitates the usage of Semantic Web data within the Hadoop ecosystem and which makes it possible to carry out tasks such as the evaluation of SPARQL queries in a scalable manner.