Exploiting tree structures for classifying programs by functionalities

Analyzing source code to solve software engineering problems such as fault prediction, cost, and effort estimation always receives attention of researchers as well as companies. The traditional approaches are based on machine learning, and software metrics obtained by computing standard measures of software projects. However, these methods have faced many challenges due to limitations of using software metrics which were not enough to capture the complexity of programs. The aim of this paper is to apply several natural language processing techniques, which deal with software engineering problems by exploring information of programs' abstract syntax trees (ASTs) instead of software metrics. To speed up computational time, we propose a pruning tree technique to eliminate redundant branches of ASTs. In addition, the k-Nearest Neighbor (kNN) algorithm was adopted to compare with other methods whereby the distance between programs is measured by using the tree edit distance (TED) and the Levenshtein distance. These algorithms are evaluated based on the performance of solving 104-label program classification problem. The experiments show that due to the use of appropriate data structures although kNN is a simple machine learning algorithm, the classifiers achieve the promising results.

[1]  Fumio Akiyama,et al.  An Example of Software System Debugging , 1971, IFIP Congress.

[2]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[3]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[4]  Li-Wei Chen,et al.  Integration of the grey relational analysis with genetic algorithm for software effort estimation , 2008, Eur. J. Oper. Res..

[5]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[6]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[7]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[8]  Erik D. Demaine,et al.  An O(n^3)-Time Algorithm for Tree Edit Distance , 2005, ArXiv.

[9]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[10]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[11]  Cagatay Catal,et al.  Software fault prediction: A literature review and current trends , 2011, Expert Syst. Appl..

[12]  Li Zhou,et al.  Automated misspelling detection and correction in clinical free-text records , 2015, J. Biomed. Informatics.

[13]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[14]  Jaswinder Kaur,et al.  Neural Network-A Novel Technique for Software Effort Estimation , 2010 .

[15]  Eitan Yaakobi,et al.  Codes in the damerau distance for DNA storage , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[16]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[17]  Sandip Modha,et al.  Differential Weight Based Hybrid Approach to Detect Software Plagiarism , 2016 .

[18]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[19]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[20]  Zhi-Hua Zhou,et al.  Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code , 2016, IJCAI.

[21]  Norton Trevisan Roman,et al.  Spelling Error Patterns in Brazilian Portuguese , 2015, Computational Linguistics.

[22]  D. Binkley,et al.  Software Fault Prediction using Language Processing , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).

[23]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[24]  Nanna Suryana,et al.  Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction , 2013 .

[25]  Irfan Ahmad,et al.  An Ensemble of Computational Intelligence Models for Software Maintenance Effort Prediction , 2013, IWANN.

[26]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.