Towards summarizing program statements in source code search

A common practice among programmers is to find pieces of source code using search engines. Programs retrieved by these engines are typically semantically but not necessarily syntactically similar. As a result, ranking methods are exploited to present relevant programs to users. However, due to implementation variability, users need to understand such programs. In this paper, we propose a method to group statements into clusters from a set of programs retrieved by a source code search engine. Each cluster comprises a number of program statements that have similar but not exact semantics and are pervasive. Our hypothesis is that such clusters help understand at a glance a set of semantically-related programs. We use approximate graph alignment to find correspondences among statements in two program dependence graphs that are similar with respect to their control and data flows, as well as operations they perform. We then build a graph with pairwise comparisons of program dependence graphs, and cast the problem of clustering statements as finding communities of statements that consistently align. Our evaluation using programs collected by BigCloneBench shows that clusters of statements discovered by our approach help discern implementation variations.

[1]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[2]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[3]  André van der Hoek,et al.  Sameness: An Experiment in Code Search , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[4]  Thomas W. Reps,et al.  The use of program dependence graphs in software engineering , 1992, International Conference on Software Engineering.

[5]  Martin Schäf,et al.  Multistaging to understand: Distilling the essence of java code examples , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[6]  Carlos R. Rivero,et al.  Towards a framework for generating program dependence graphs from source code , 2018, SWAN@ESEC/SIGSOFT FSE.

[7]  Cristina V. Lopes,et al.  How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.

[8]  James Cheng,et al.  Efficient core decomposition in massive networks , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[9]  Carlos R. Rivero,et al.  Clustering Recurrent and Semantically Cohesive Program Statements in Introductory Programming Assignments , 2019, CIKM.

[10]  Ahmad Taherkhani,et al.  Recognizing Sorting Algorithms with the C4.5 Decision Tree Classifier , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[11]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[12]  Carlos R. Rivero,et al.  Automated Personalized Feedback in Introductory Java Programming MOOCs , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).