An Extensive Dataset of UML Models in GitHub

The Unified Modeling Language (UML) is widely taught in academia and has good acceptance in industry. However, there is not an ample dataset of UML diagrams publicly available. Our aim is to offer a dataset of UML files, together with meta-data of the software projects where the UML files belong to. Therefore, we have systematically mined over 12 million GitHub projects to find UML files in them. We present a semi-automated approach to collect UML stored in images, .xmi, and .uml files. We offer a dataset with over 93,000 UML diagrams from over 24,000 projects in GitHub.

[1]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[2]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[3]  Michel R. V. Chaudron,et al.  Automatic Classification of UML Class Diagrams from Images , 2014, 2014 21st Asia-Pacific Software Engineering Conference.

[4]  Foutse Khomh,et al.  Studying the Relation between Anti-Patterns in Design Models and in Source Code , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[5]  Jesús M. González-Barahona,et al.  Tools for the Study of the Usual Data Sources found in Libre Software Projects , 2009, Int. J. Open Source Softw. Process..

[6]  Michel R. V. Chaudron,et al.  Online Img2UML Repository: An Online Repository for UML Models , 2013, EESSMod@MoDELS.

[7]  Gregorio Robles,et al.  The quest for open source projects that use UML: mining GitHub , 2016, MoDELS.

[8]  Ivar Jacobson,et al.  Unified Modeling Language User Guide, The (2nd Edition) (Addison-Wesley Object Technology Series) , 2005 .

[9]  Regina Hebig,et al.  An Index for Software Engineering Models , 2014, PSRC@MoDELs.