Optimizing RDF(S) queries on cloud platforms

Scalable processing of Semantic Web queries has become a critical need given the rapid upward trend in availability of Semantic Web data. The MapReduce paradigm is emerging as a platform of choice for large scale data processing and analytics due to its ease of use, cost effectiveness, and potential for unlimited scaling. Processing queries on Semantic Web triple models is a challenge on the mainstream MapReduce platform called Apache Hadoop, and its extensions such as Pig and Hive. This is because such queries require numerous joins which leads to lengthy and expensive MapReduce workflows. Further, in this paradigm, cloud resources are acquired on demand and the traditional join optimization machinery such as statistics and indexes are often absent or not easily supported. In this demonstration, we will present RAPID+, an extended Apache Pig system that uses an algebraic approach for optimizing queries on RDF data models including queries involving inferencing. The basic idea is that by using logical and physical operators that are more natural to MapReduce processing, we can reinterpret such queries in a way that leads to more concise execution workflows and small intermediate data footprints that minimize disk I/Os and network transfer overhead. RAPID+ evaluates queries using the Nested TripleGroup Data Model and Algebra(NTGA). The demo will show comparative performance of NTGA query plans vs. relational algebra-like query plans used by Apache Pig and Hive.