Automatic clustering constraints derivation from object-oriented software using weighted complex network with graph theory analysis

Abstract Constrained clustering or semi-supervised clustering has received a lot of attention due to its flexibility of incorporating minimal supervision of domain experts or side information to help improve clustering results of classic unsupervised clustering techniques. In the domain of software remodularisation, classic unsupervised software clustering techniques have proven to be useful to aid in recovering a high-level abstraction of the software design of poorly documented or designed software systems. However, there is a lack of work that integrates constrained clustering for the same purpose to help improve the modularity of software systems. Nevertheless, due to time and budget constraints, it is laborious and unrealistic for domain experts who have prior knowledge about the software to review each and every software artifact and provide supervision on an on-demand basis. We aim to fill this research gap by proposing an automated approach to derive clustering constraints from the implicit structure of software system based on graph theory analysis of the analysed software. Evaluations conducted on 40 open-source object-oriented software systems show that the proposed approach can serve as an alternative solution to derive clustering constraints in situations where domain experts are non-existent, thus helping to improve the overall accuracy of clustering results.

[1]  Eric Bair,et al.  Semi‐supervised clustering methods , 2013, Wiley interdisciplinary reviews. Computational statistics.

[2]  Sergiu M. Dascalu,et al.  Unit-level test adequacy criteria for visual dataflow languages and a testing methodology , 2008, TSEM.

[3]  Jing Liu,et al.  A Hybrid Set of Complexity Metrics for Large-Scale Object-Oriented Software Systems , 2010, Journal of Computer Science and Technology.

[4]  Kiri Wagstaff,et al.  Value, Cost, and Sharing: Open Issues in Constrained Clustering , 2006, KDID.

[5]  Günther Palm,et al.  On the Effects of Constraints in Semi-supervised Hierarchical Clustering , 2006, ANNPR.

[6]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[7]  Shlomo Moran,et al.  Optimal implementations of UPGMA and other common clustering algorithms , 2007, Inf. Process. Lett..

[8]  Michele Marchesi,et al.  The fractal dimension of software networks as a global quality metric , 2013, Inf. Sci..

[9]  Onaiza Maqbool,et al.  Hierarchical Clustering for Software Architecture Recovery , 2007, IEEE Transactions on Software Engineering.

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Vincenzo Loia,et al.  Automatic constraints generation for semisupervised clustering: experiences with documents classification , 2016, Soft Comput..

[12]  Albert-László Barabási,et al.  Controllability of complex networks , 2011, Nature.

[13]  Baowen Xu,et al.  A complexity measure for ontology based on UML , 2004, Proceedings. 10th IEEE International Workshop on Future Trends of Distributed Computing Systems, 2004. FTDCS 2004..

[14]  Sergi Valverde,et al.  Hierarchical Small Worlds in Software Architecture , 2003 .

[15]  Sergei Maslov,et al.  Universal distribution of component frequencies in biological and technological systems , 2013, Proceedings of the National Academy of Sciences.

[16]  Michele Marchesi,et al.  On the Distribution of Bugs in the Eclipse System , 2011, IEEE Transactions on Software Engineering.

[17]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[18]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[19]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[20]  Yuanyuan Zhang,et al.  Search-based software engineering: Trends, techniques and applications , 2012, CSUR.

[21]  Ramanath Subramanyam,et al.  Empirical Analysis of CK Metrics for Object-Oriented Design Complexity: Implications for Software Defects , 2003, IEEE Trans. Software Eng..

[22]  Jun Fang,et al.  Rank-directed layout of UML class diagrams , 2012, SoftwareMining '12.

[23]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[24]  Jörg Sander,et al.  Decomposing object-oriented class modules using an agglomerative clustering technique , 2009, 2009 IEEE International Conference on Software Maintenance.

[25]  Michele Marchesi,et al.  Power-Laws in a Large Object-Oriented Software System , 2007, IEEE Transactions on Software Engineering.

[26]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[27]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[28]  Vassilios Tzerpos,et al.  Software clustering based on omnipresent object detection , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[29]  Samantha Jenkins,et al.  Software architecture graphs as complex networks: A novel partitioning scheme to measure stability and evolution , 2007, Inf. Sci..

[30]  Gagandeep Singh Metrics for measuring the quality of object-oriented software , 2013, SOEN.

[31]  Kyongbum Lee,et al.  An algorithm for modularity analysis of directed and weighted biological networks based on edge-betweenness centrality , 2006, Bioinform..

[32]  Khaled El Emam,et al.  The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics , 2001, IEEE Trans. Software Eng..

[33]  O. Sporns,et al.  Complex brain networks: graph theoretical analysis of structural and functional systems , 2009, Nature Reviews Neuroscience.

[34]  James Noble,et al.  Scale-free geometry in OO programs , 2005, CACM.

[35]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[36]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[37]  Fernando Brito e Abreu,et al.  Object-Oriented Software Engineering: Measuring and Controlling the Development Process , 1994 .

[38]  Lionel C. Briand,et al.  An Investigation of Graph-Based Class Integration Test Order Strategies , 2003, IEEE Trans. Software Eng..

[39]  Claudio Riva,et al.  Reverse architecting: an industrial experience report , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[40]  Michele Marchesi,et al.  Entropy of the degree distribution and object-oriented software quality , 2012, 2012 3rd International Workshop on Emerging Trends in Software Metrics (WETSoM).

[41]  Diomidis Spinellis,et al.  Power laws in software , 2008, TSEM.

[42]  Chung-Horng Lung,et al.  Applications of clustering techniques to software partitioning, recovery and restructuring , 2004, J. Syst. Softw..

[43]  Alexander Chatzigeorgiou,et al.  Trends in object-oriented software evolution: Investigating network properties , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[44]  Timothy C. Lethbridge,et al.  Recovering software architecture from the names of source files , 1999 .

[45]  Teck Chaw Ling,et al.  Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach , 2013, Inf. Softw. Technol..

[46]  Chung-Horng Lung,et al.  Using Hierarchical Agglomerative Clustering in Wireless Sensor Networks: An Energy-Efficient and Flexible Approach , 2008, IEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference.

[47]  Onaiza Maqbool,et al.  Automated software clustering: An insight using cluster labels , 2006, J. Syst. Softw..

[48]  Derek Greene,et al.  Constraint Selection by Committee: An Ensemble Approach to Identifying Informative Constraints for Semi-supervised Clustering , 2007, ECML.

[49]  Sai Peck Lee,et al.  Constrained agglomerative hierarchical software clustering with hard and soft constraints , 2015, 2015 International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE).

[50]  Mario Piattini,et al.  Analyzing the Harmful Effect of God Class Refactoring on Power Consumption , 2014, IEEE Software.

[51]  Hichem Frigui,et al.  Fuzzy Clustering and Aggregation of Relational Data With Instance-Level Constraints , 2008, IEEE Transactions on Fuzzy Systems.

[52]  Clemente Izurieta,et al.  On the Uncertainty of Technical Debt Measurements , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[53]  Sai Peck Lee,et al.  Analyzing maintainability and reliability of object-oriented software using weighted complex network , 2015, J. Syst. Softw..

[54]  Sadaaki Miyamoto,et al.  An Overview of Hierarchical and Non-hierarchical Algorithms of Clustering for Semi-supervised Classification , 2012, MDAI.

[55]  Vassilios Tzerpos,et al.  An effectiveness measure for software clustering algorithms , 2004, Proceedings. 12th IEEE International Workshop on Program Comprehension, 2004..

[56]  Letha H. Etzkorn,et al.  Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes , 2007, IEEE Transactions on Software Engineering.

[57]  Abdelwahab Hamou-Lhadj,et al.  Quality of the Source Code for Design and Architecture Recovery Techniques: Utilities are the Problem , 2009, 2009 Ninth International Conference on Quality Software.

[58]  Lionel C. Briand,et al.  Revisiting strategies for ordering class integration testing in the presence of dependency cycles , 2001, Proceedings 12th International Symposium on Software Reliability Engineering.

[59]  Gabriele Bavota,et al.  An empirical study on the developers' perception of software coupling , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[60]  Jian Feng Cui,et al.  Applying agglomerative hierarchical clustering algorithms to component identification for legacy systems , 2011, Inf. Softw. Technol..

[61]  Fabian Beck,et al.  Identifying modularization patterns by visual comparison of multiple hierarchies , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[62]  Xiaoli Z. Fern,et al.  Active Learning of Constraints for Semi-Supervised Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[63]  Jean-Louis Letouzey,et al.  Managing Technical Debt with the SQALE Method , 2012, IEEE Software.

[64]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[65]  Fabian Beck,et al.  On the impact of software evolution on software clustering , 2012, Empirical Software Engineering.

[66]  Abdelwahab Hamou-Lhadj,et al.  Software Clustering Using Dynamic Analysis and Static Dependencies , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[67]  Albert-László Barabási,et al.  Hierarchical organization in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[68]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.