An Automated Approach for Classifying Reverse-Engineered and Forward-Engineered UML Class Diagrams

UML Class diagrams are commonly used to describe the designs of systems. Such designs can be used to guide the construction of software. In practice, we have identified two main types of using UML: i) FwCD refers to diagrams are hand-made as part of the forward-looking development process; ii) RECD refers to those diagrams that are reverse engineered from the source code; Recently, empirical studies in Software Engineering have started looking at open source projects. This enables the automated extraction and analysis of large sets of project-data. For researching the effects of UML modeling in open source projects, we need a way to automatically determine the way in which UML used in such projects. For this, we propose an automated classifier for deciding whether a diagram is an FwCD or an RECD. We present the construction of such a classifier by means of (supervised) machine learning algorithms. As part of its construction, we analyse which features are useful in classifying FwCD and RECD. By comparing different machine learning algorithms, we find that the Random Forest algorithm is the most suitable algorithm for our purpose. We evaluate the performance of the classifier on a test set of 999 class diagrams obtained from open source projects.

[1]  Gregorio Robles,et al.  The quest for open source projects that use UML: mining GitHub , 2016, MoDELS.

[2]  Dragan Gasevic,et al.  Assessing the maintainability of software product line feature models using structural metrics , 2011, Software Quality Journal.

[3]  Nakarin Maneerat,et al.  Bad-smell prediction from software design model using machine learning techniques , 2011, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE).

[4]  David Lo,et al.  Condensing class diagrams by analyzing design and network metrics using optimistic classification , 2014, ICPC 2014.

[5]  James H. Cross,et al.  Reverse engineering and design recovery: a taxonomy , 1990, IEEE Software.

[6]  Rainer Hoch,et al.  On the evaluation of document analysis components by recall, precision, and accuracy , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[7]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[8]  Michel R. V. Chaudron,et al.  An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams , 2013, 2013 IEEE International Conference on Software Maintenance.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[11]  Michel R. V. Chaudron,et al.  Empirical Analysis of the Relation between Level of Detail in UML Models and Defect Density , 2008, MoDELS.

[12]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[13]  Michel R. V. Chaudron,et al.  Img2UML: A System for Extracting UML Models from Images , 2013, 2013 39th Euromicro Conference on Software Engineering and Advanced Applications.

[14]  Michael J. Pazzani,et al.  Learning Collaborative Information Filters , 1998, ICML.

[15]  Arwin Halim Predict fault-prone classes using the complexity of UML class diagram , 2013, 2013 International Conference on Computer, Control, Informatics and Its Applications (IC3INA).

[16]  Mohd Hafeez Osman,et al.  Interactive scalable condensation of reverse engineered UML class diagrams for software comprehension , 2015 .