Empowering OCL research: a large-scale corpus of open-source data from GitHub

Model-driven engineering (MDE) enables the rise in abstraction during development in software and system design. In particular, meta-models become a central artifact in the process, and are supported by various other artifacts such as editors and transformation. In order to define constraints, invariants, and queries on model-driven artifacts, a generic language has been developed: the Object Constraint Language (OCL). In literature, many studies into OCL have been performed on small collections of data, mostly originating from a single source (e.g., OMG standards). As such, generalization of results beyond the data studied is often mentioned as a threat to validity. Creation of a benchmark dataset has already been identified as a key enabler to address the generalization threat. To facilitate further empirical studies in the field of OCL, we present the first large-scale dataset of 103262 OCL expression, systematically extracted from 671 GitHub repositories. In particular, our dataset has extracted these expressions from various types of files (a.o. metamodels and model-to-text transformations). In this work we showcase a variety of different studies performed using our dataset, and describe several other types that could be performed. We extend previous work with data and experiments regarding OCL in model-to-text (mtl) transformations.

[1]  Anneke Kleppe,et al.  The Object Constraint Language: Getting Your Models Ready for MDA , 2003 .

[2]  Richard F. Paige,et al.  Model Migration with Epsilon Flock , 2010, ICMT@TOOLS.

[3]  J. Hintze,et al.  Violin plots : A box plot-density trace synergism , 1998 .

[4]  Gregorio Robles,et al.  The quest for open source projects that use UML: mining GitHub , 2016, MoDELS.

[5]  Alexander Serebrenik,et al.  Automated analyses of model-driven artifacts: obtaining insights into industrial application of MDE , 2017, IWSM-Mensura.

[6]  Indrakshi Ray,et al.  On challenges of model transformation from UML to Alloy , 2008, Software & Systems Modeling.

[7]  Juri Di Rocco,et al.  MDEForge: an Extensible Web-Based Modeling Platform , 2014, CloudMDE@MoDELS.

[8]  Jesús Sánchez Cuadrado,et al.  Building Domain-Specific Languages for Model-Driven Development , 2007, IEEE Software.

[9]  Alexander Serebrenik,et al.  How do Scratch Programmers Name Variables and Procedures? , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[10]  Martin Gogolla,et al.  Continuing a Benchmark for UML and OCL Design and Analysis Tools , 2016, STAF Workshops.

[11]  Alexander Serebrenik,et al.  Challenges for Static Analysis of Java Reflection - Literature Review and Empirical Study , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[12]  Krzysztof Kaczmarski,et al.  OCL as the Query Language for UML Model Execution , 2008, ICCS.

[13]  Martin Gogolla,et al.  Initiating a Benchmark for UML and OCL Analysis Tools , 2013, TAP@STAF.

[14]  Daniel Jackson,et al.  Software Abstractions - Logic, Language, and Analysis , 2006 .

[15]  Tom Mens,et al.  On the variation and specialisation of workload—A case study of the Gnome ecosystem community , 2014, Empirical Software Engineering.

[16]  Mark Rouncefield,et al.  The State of Practice in Model-Driven Engineering , 2014, IEEE Software.

[17]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[18]  Martin Gogolla,et al.  On Formalizing the UML Object Constraint Language OCL , 1998, ER.

[19]  Mario Piattini,et al.  Does object coupling really affect the understanding and modifying of OCL expressions? , 2006, SAC '06.

[20]  Richard F. Paige,et al.  The Epsilon Transformation Language , 2008, ICMT@TOOLS.

[21]  Mark Rouncefield,et al.  Model-driven engineering practices in industry , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[22]  K. Gabriel,et al.  SIMULTANEOUS TEST PROCEDURES-SOME THEORY OF MULTIPLE COMPARISONS' , 1969 .

[23]  Edward D. Willink Aligning OCL with UML , 2011 .

[24]  Marco Tulio Valente,et al.  RTTool: A Tool for Extracting Relative Thresholds for Source Code Metrics , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[25]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[26]  Jean Bézivin,et al.  Model Driven Engineering: An Emerging Technical Space , 2005, GTTSE.

[27]  Márcio de Oliveira Barros,et al.  An empirical study of the impact of OCL smells and refactorings on the understandability of OCL specifications , 2007, MODELS'07.

[28]  E. Brunner,et al.  The Nonparametric Behrens‐Fisher Problem: Asymptotic Theory and a Small‐Sample Approximation , 2000 .

[29]  Gabriele Taentzer,et al.  A Visualization of OCL Using Collaborations , 2001, UML.

[30]  Edgar Brunner,et al.  Rank-based multiple test procedures and simultaneous confidence intervals , 2012 .

[31]  Arie van Deursen,et al.  Domain-Specific Languages in Practice: A User Study on the Success Factors , 2009, MoDELS.

[32]  Alexander Serebrenik,et al.  A Data Set of OCL Expressions on GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[33]  Markus Scheidgen CMOF-model semantics and language mapping for MOF 2.0 implementations , 2006, Fourth Workshop on Model-Based Development of Computer-Based Systems and Third International Workshop on Model-Based Methodologies for Pervasive and Embedded Software (MBD-MOMPES'06).

[34]  Joost Visser,et al.  A Practical Model for Measuring Maintainability , 2007, 6th International Conference on the Quality of Information and Communications Technology (QUATIC 2007).

[35]  T. Mens,et al.  Evidence for the Pareto principle in Open Source Software Activity , 2011 .

[36]  Frédéric Jouault,et al.  Transforming Models with ATL , 2005, MoDELS.

[37]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[38]  Angélica Caro,et al.  A Probabilistic Approach to Web Portal's Data Quality Evaluation , 2007 .

[39]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[40]  Benoît Combemale,et al.  An analysis of metamodeling practices for MOF and OCL , 2015, Comput. Lang. Syst. Struct..

[41]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[42]  Alexander Serebrenik,et al.  How Swift Developers Handle Errors , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[43]  Alexander Serebrenik,et al.  A Case of Industrial vs. Open-source OCL: Not So Different After All , 2017, MODELS.

[44]  Premkumar T. Devanbu,et al.  Assert Use in GitHub Projects , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[45]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[46]  Martin Gogolla,et al.  USE: A UML-based specification environment for validating UML and OCL , 2007, Sci. Comput. Program..

[47]  Anneke Kleppe,et al.  MDA explained - the Model Driven Architecture: practice and promise , 2003, Addison Wesley object technology series.

[48]  Gregorio Robles,et al.  An Extensive Dataset of UML Models in GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[49]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[50]  Tim Menzies,et al.  Perspectives on Data Science for Software Engineering , 2016, Perspectives on Data Science for Software Engineering.

[51]  Mark Rouncefield,et al.  Empirical assessment of MDE in industry , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[52]  Miguel A. Fernández,et al.  An empirical study of the state of the practice and acceptance of model-driven engineering in four industrial cases , 2012, Empirical Software Engineering.

[53]  Martin Gogolla,et al.  Extensive Validation of OCL Models by Integrating SAT Solving into USE , 2011, TOOLS.

[54]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[55]  Jeffrey C. Carver,et al.  The role of replications in Empirical Software Engineering , 2008, Empirical Software Engineering.

[56]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[57]  Andrew P. Black,et al.  How we refactor, and how we know it , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[58]  Sophia Ananiadou,et al.  Assessing the Use of Eclipse MDE Technologies in Open-Source Software Projects , 2015, OSS4MDE@MoDELS.

[59]  Jordi Cabot,et al.  A metric for measuring the complexity of OCL expressions , 2006 .

[60]  Gang Yin,et al.  Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? , 2016, Inf. Softw. Technol..

[61]  Bruno D. Zumbo,et al.  Parametric Alternatives to the Student T Test under Violation of Normality and Homogeneity of Variance , 1992 .

[62]  Marian Petre,et al.  UML in practice , 2013, 2013 35th International Conference on Software Engineering (ICSE).