Duplicated code pattern mining in visual programming languages

Visual Programming Languages (VPLs), coupled with the high-level abstractions that are commonplace in visual programming environments, enable users with less technical knowledge to become proficient programmers. However, the lower skill floor required by VPLs also entails that programmers are more likely to not adhere to best practices of software development, producing systems with high technical debt, and thus poor maintainability. Duplicated code is one important example of such technical debt. In fact, we observed that the amount of duplication in the OutSystems VPL code bases can reach as high as 39%. Duplicated code detection in text-based programming languages is still an active area of research with important implications regarding software maintainability and evolution. However, to the best of our knowledge, the literature on duplicated code detection for VPLs is very limited. We propose a novel and scalable duplicated code pattern mining algorithm that leverages the visual structure of VPLs in order to not only detect duplicated code, but also highlight duplicated code patterns that explain the reported duplication. The performance of the proposed approach is evaluated on a wide range of real-world mobile and web applications developed using OutSystems.

[1]  Peter Willett,et al.  Maximum common subgraph isomorphism algorithms for the matching of chemical structures , 2002, J. Comput. Aided Mol. Des..

[2]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[3]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[4]  Harry G. Barrow,et al.  Subgraph Isomorphism, Matching Relational Structures and Maximal Cliques , 1976, Inf. Process. Lett..

[5]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[6]  Matti Järvisalo,et al.  LMHS: A SAT-IP Hybrid MaxSAT Solver , 2016, SAT.

[7]  Jianyong Chen,et al.  A Novel Optimized Path-Based Algorithm for Model Clone Detection , 2014, J. Softw..

[8]  Brenda S. Baker,et al.  Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance , 1997, SIAM J. Comput..

[9]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[10]  Kun He,et al.  A Learning Based Branch and Bound for Maximum Common Subgraph Related Problems , 2020, AAAI.

[11]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[12]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[13]  Miyuki Koshimura,et al.  QMaxSAT: A Partial Max-SAT Solver , 2012, J. Satisf. Boolean Model. Comput..

[14]  Gilles Audemard,et al.  Improving Glucose for Incremental SAT Solving with Assumptions: Application to MUS Extraction , 2013, SAT.

[15]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[16]  Christine Solnon,et al.  Clique and Constraint Models for Maximum Common (Connected) Subgraph Problems , 2016, CP.

[17]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[18]  Felip Manyà,et al.  MaxSAT, Hard and Soft Constraints , 2021, Handbook of Satisfiability.

[19]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[20]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[21]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[22]  Mark Stamp,et al.  Deriving common malware behavior through graph clustering , 2013, Comput. Secur..

[23]  Joao Marques-Silva,et al.  Core-Guided MaxSAT with Soft Cardinality Constraints , 2014, International Conference on Principles and Practice of Constraint Programming.

[24]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[25]  Joao Marques-Silva,et al.  Iterative and core-guided MaxSAT solving: A survey and assessment , 2013, Constraints.

[26]  Mohammad Al Hasan,et al.  An integrated, generic approach to pattern mining: data mining template library , 2008, Data Mining and Knowledge Discovery.

[27]  Jürgen Wolff von Gudenberg,et al.  Clone detection in source code by frequent itemset techniques , 2004, Source Code Analysis and Manipulation, Fourth IEEE International Workshop on.

[28]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[29]  Mikolás Janota,et al.  Exploiting Resolution-Based Representations for MaxSAT Solving , 2015, SAT.

[30]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[31]  Magdalena Balazinska,et al.  Measuring clone based reengineering opportunities , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[32]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[33]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[34]  Yang Zhang,et al.  NiCad+: Speeding the Detecting Process of NiCad , 2020, 2020 IEEE International Conference on Service Oriented Systems Engineering (SOSE).

[35]  Michael W. Godfrey,et al.  Supporting the analysis of clones in software systems , 2006, J. Softw. Maintenance Res. Pract..

[36]  Philippe Vismara,et al.  Finding Maximum Common Connected Subgraphs Using Clique Detection or Constraint Satisfaction Algorithms , 2008, MCO.

[37]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[38]  Hoan Anh Nguyen,et al.  Complete and accurate clone detection in graph-based models , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[39]  Hong Liang,et al.  SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[40]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[41]  Yanzhao Wu,et al.  CCAligner: A Token Based Large-Gap Clone Detector , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[42]  Joao Marques-Silva,et al.  PySAT: A Python Toolkit for Prototyping with SAT Oracles , 2018, SAT.

[43]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[45]  Jens Krinke,et al.  Siamese: scalable and incremental code clone search via multiple code representations , 2019, Empirical Software Engineering.

[46]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[47]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[48]  Isil Dillig,et al.  Automated Synthesis of Semantic Malware Signatures using Maximum Satisfiability , 2016, NDSS.

[49]  Yinxing Xue,et al.  CCGraph: a PDG-based code clone detector with approximate graph matching , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[50]  Daniel Strüber,et al.  Model clone detection for rule-based model transformation languages , 2017, Software & Systems Modeling.

[51]  Arutyun Avetisyan,et al.  Scalable and accurate detection of code clones , 2016, Programming and Computer Software.

[52]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[53]  Min Wang,et al.  CCSharp: An Efficient Three-Phase Code Clone Detector Using Modified PDGs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[54]  James R. Cordy,et al.  Models are code too: Near-miss clone detection for Simulink models , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[55]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[56]  Bernhard Schätz,et al.  Clone detection in automotive model-based development , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[57]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[58]  Bernhard Schätz,et al.  Model clone detection in practice , 2010, IWSC '10.

[59]  Cid C. de Souza,et al.  The maximum common edge subgraph problem: A polyhedral investigation , 2012, Discret. Appl. Math..

[60]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[61]  Ciaran McCreesh,et al.  A Partitioning Algorithm for Maximum Common Subgraph Problems , 2017, IJCAI.

[62]  Long Chen,et al.  Neural Detection of Semantic Code Clones Via Tree-Based Convolution , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).