Empirical study on the usage of graph query languages in open source Java projects

Graph data models are interesting in various domains, in part because of the intuitiveness and flexibility they offer compared to relational models. Specialized query languages, such as Cypher for property graphs or SPARQL for RDF, facilitate their use. In this paper, we present an empirical study on the usage of graph-based query languages in open-source Java projects on GitHub. We investigate the usage of SPARQL, Cypher, Gremlin and GraphQL in terms of popularity and their development over time. We select repositories based on dependencies related to these technologies and employ various popularity and source-code based filters and ranking features for a targeted selection of projects. For the concrete languages SPARQL and Cypher, we analyze the activity of repositories over time. For SPARQL, we investigate common application domains, query use and existence of ontological data modeling in applications that query for concrete instance data. Our results show, that the usage of graph query languages in open-source projects increased over the last years, with SPARQL and Cypher being by far the most popular. SPARQL projects are more active in terms of query related artifact changes and unique developers involved, but Cypher is catching up. Relatively few applications use SPARQL to query for concrete instance data: A majority of those applications employ multiple different ontologies, including project and domain specific ones. Common application domains are management systems and data visualization tools.

[1]  Andrea De Lucia,et al.  Do Developers Update Third-Party Libraries in Mobile Apps? , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[2]  Peter T. Wood,et al.  Query languages for graph databases , 2012, SGMD.

[3]  Haidar Osman,et al.  On the evolution of exception usage in Java projects , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[4]  James D. Herbsleb,et al.  Influence of social and technical factors for evaluating contribution in GitHub , 2014, ICSE.

[5]  Fabian Trautsch,et al.  Are There Any Unit Tests? An Empirical Study on Unit Testing in Open Source Python Projects , 2017, 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[6]  David Lo,et al.  Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[7]  Justin J. Miller,et al.  Graph Database Applications and Concepts with Neo4j , 2013 .

[8]  Sophia Ananiadou,et al.  Assessing the Use of Eclipse MDE Technologies in Open-Source Software Projects , 2015, OSS4MDE@MoDELS.

[9]  Xiaoyin Wang,et al.  An Empirical Study on the Usage of Mocking Frameworks in Software Testing , 2014, 2014 14th International Conference on Quality Software.

[10]  Sergio Greco,et al.  Querying Graph Databases , 2000, EDBT.

[11]  Ralf Lämmel,et al.  Empirical Language Analysis in Software Linguistics , 2010, SLE.

[12]  Ralf Lämmel,et al.  Large-scale, AST-based API-usage analysis of open-source Java projects , 2011, SAC.

[13]  Salim Jouili,et al.  An Empirical Comparison of Graph Databases , 2013, 2013 International Conference on Social Computing.

[14]  Fabio Palomba,et al.  Mining File Histories: Should We Consider Branches? , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.

[16]  Yingjun Lyu,et al.  An Empirical Study of Local Database Usage in Android Applications , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  Marko A. Rodriguez,et al.  The Gremlin Graph Traversal Machine and Language , 2015, ArXiv.

[18]  Stefan Plantikow,et al.  Cypher: An Evolving Query Language for Property Graphs , 2018, SIGMOD Conference.

[19]  Eirini Kalliamvakou,et al.  An in-depth study of the promises and perils of mining GitHub , 2016, Empirical Software Engineering.

[20]  María Poveda-Villalón,et al.  Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web , 2016, Semantic Web.

[21]  Raphael Volz,et al.  A Comparison of RDF Query Languages , 2004, SEMWEB.

[22]  Ralf Lämmel,et al.  Understanding privacy policies - A study in empirical analysis of language usage , 2013, Empir. Softw. Eng..

[23]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[24]  Juan Sequeda,et al.  G-CORE: A Core for Future Graph Query Languages , 2017, SIGMOD Conference.

[25]  Ralf Lämmel,et al.  Multi-dimensional exploration of API usage , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[26]  Premkumar T. Devanbu,et al.  Assert Use in GitHub Projects , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[27]  Eleni Stroulia,et al.  Co-evolution of project documentation and popularity within github , 2014, MSR 2014.

[28]  Shinji Kusumoto,et al.  Hey! are you committing tangled changes? , 2014, ICPC 2014.

[29]  René Peinl,et al.  Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4j , 2013, EDBT '13.

[30]  Renzo Angles,et al.  A Comparison of Current Graph Database Models , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[31]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[32]  Premkumar T. Devanbu,et al.  Gender and Tenure Diversity in GitHub Teams , 2015, CHI.

[33]  Manishankar Mondal,et al.  [Research Paper] Detecting Evolutionary Coupling Using Transitive Association Rules , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[34]  Ralf Lämmel,et al.  Vivisection of a Non-Executable, Domain-Specific Language - Understanding (the Usage of) the P3P Language , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[35]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[36]  Louis Mandel,et al.  An Empirical Study of GraphQL Schemas , 2019, ICSOC.

[37]  Giuliano Antoniol,et al.  An automatic approach to identify class evolution discontinuities , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[38]  Yixin Chen,et al.  A comparison of a graph database and a relational database: a data provenance perspective , 2010, ACM SE '10.

[39]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[40]  Carlo Ghezzi,et al.  An empirical investigation into a large-scale Java open source code repository , 2010, ESEM '10.

[41]  Eleni Stroulia,et al.  Analyzing the effects of test driven development in GitHub , 2017, Empirical Software Engineering.

[42]  Manishankar Mondal,et al.  Detecting Evolutionary Coupling Using Transitive Association Rules , 2018 .

[43]  Marcelo Arenas,et al.  Foundations of Modern Query Languages for Graph Databases , 2016, ACM Comput. Surv..

[44]  Tom Mens,et al.  On the Interaction of Relational Database Access Technologies in Open Source Java Projects , 2015, SATToSE.

[45]  Carlo A. Furia,et al.  A Comparative Study of Programming Languages in Rosetta Code , 2014, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[46]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[47]  Tom Mens,et al.  Analyzing the evolution of testing library usage in open source Java projects , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[48]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[49]  Hong Mei,et al.  An Empirical Study on API Usages , 2019, IEEE Transactions on Software Engineering.

[50]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[51]  Georgios Gousios,et al.  Untangling fine-grained code changes , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[52]  Mariano P. Consens,et al.  An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries , 2019, AMW.

[53]  Gabriele Bavota,et al.  A large-scale empirical study on the lifecycle of code smell co-occurrences , 2018, Inf. Softw. Technol..

[54]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[55]  Mariano P. Consens,et al.  Large-Scale Analysis of the Co-commit Patterns of the Active Developers in GitHub's Top Repositories , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[56]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[57]  Ondrej Lhoták,et al.  Who You Gonna Call? Analyzing Web Requests in Android Applications , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[58]  Alberto Bacchelli,et al.  A Dataset for API Usage , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[59]  Jordi Cabot,et al.  Findings from GitHub: Methods, Datasets and Limitations , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).