Fine-Grained Code-Comment Semantic Interaction Analysis

Code comment, i.e., the natural language text to describe code, is considered as a killer for program comprehension. Current literature approaches mainly focus on comment generation or comment update, and thus fall short on explaining which part of the code leads to a specific content in the comment. In this paper, we propose that addressing such a challenge can better facilitate code under-standing. We propose Fosterer, which can build fine-grained se-mantic interactions between code statements and comment tokens. It not only leverages the advanced deep learning techniques like cross-modal learning and contrastive learning, but also borrows the weapon of pre-trained vision models. Specifically, it mimics the comprehension practice of developers, treating code statements as image patches and comments as texts, and uses contrastive learning to match the semantically-related part between the visual and tex-tual information. Experiments on a large-scale manually-labelled dataset show that our approach can achieve an Fl-score around 80%, and such a performance exceeds a heuristic-based baseline to a large extent. We also find that Fosterer can work with a high efficiency, i.e., it only needs 1.5 seconds for inferring the results for a code-comment pair. Furthermore, a user study demonstrates its usability: for 65% cases, its prediction results are considered as useful for improving code understanding. Therefore, our research sheds light on a promising direction for program comprehension.

[1]  Ming Wen,et al.  Context-Aware Code Change Embedding for Better Patch Correctness Assessment , 2022, ACM Trans. Softw. Eng. Methodol..

[2]  Ting Wang,et al.  Data Augmentation by Program Transformation , 2022, J. Syst. Softw..

[3]  Xiangke Liao,et al.  Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding , 2021, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[4]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[5]  Somesh Jha,et al.  Semantic Robustness of Models of Source Code , 2020, 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[6]  Jacques Klein,et al.  Beep: Fine-grained Fix Localization by Learning to Predict Buggy Code Elements , 2021, ArXiv.

[7]  Zheng Li,et al.  Extended Abstract of SeCNN: A semantic CNN parser for code comment generation , 2021, 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[8]  Yihao Qin,et al.  Peculiar: Smart Contract Vulnerability Detection Based on Crucial Data Flow Graph and Pre-training Techniques , 2021, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE).

[9]  Dongmei Zhang,et al.  CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees , 2021, EMNLP.

[10]  Venera Arnaoudova,et al.  Reassessing automatic evaluation metrics for code summarization tasks , 2021, ESEC/SIGSOFT FSE.

[11]  Xiaoguang Mao,et al.  Lightweight global and local contexts guided method name recommendation with prior knowledge , 2021, ESEC/SIGSOFT FSE.

[12]  Xiaoguang Mao,et al.  Automated Comment Update: How Far are We? , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[13]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[14]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[15]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[16]  Mohammad Amin Alipour,et al.  On the generalizability of Neural Program Models with respect to semantic-preserving program transformations , 2020, Inf. Softw. Technol..

[17]  Zibin Zheng,et al.  Towards automatically generating block comments for code snippets , 2020, Inf. Softw. Technol..

[18]  Meng Yan,et al.  Automating Just-In-Time Comment Updating , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Bo Lin,et al.  Automated Patch Correctness Assessment: How Far are We? , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Zachary Eberhart,et al.  A Human Study of Comprehension and Code Summarization , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[21]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[22]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[23]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[24]  Beijun Shen,et al.  Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[25]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[26]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[27]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[28]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[29]  Y. Srikant,et al.  IR2VEC , 2020, ACM Trans. Archit. Code Optim..

[30]  Saeed Hassanpour,et al.  Generative Image Translation for Data Augmentation in Colorectal Histopathology Images , 2019, ML4H@NeurIPS.

[31]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[33]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[34]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[36]  Konrad Rieck,et al.  Misleading Authorship Attribution of Source Code using Adversarial Learning , 2019, USENIX Security Symposium.

[37]  Gabriele Bavota,et al.  A Large-Scale Empirical Study on Code-Comment Inconsistencies , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[38]  Yves Le Traon,et al.  Learning to Spot and Refactor Inconsistent Method Names , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[39]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[40]  Long Chen,et al.  Neural Detection of Semantic Code Clones Via Tree-Based Convolution , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[41]  Alberto Bacchelli,et al.  Classifying code comments in Java software systems , 2019, Empirical Software Engineering.

[42]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[43]  Koushik Sen,et al.  Aroma: code recommendation via structural code search , 2018, Proc. ACM Program. Lang..

[44]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[45]  Zhenchang Xing,et al.  Measuring Program Comprehension: A Large-Scale Field Study with Professionals , 2018, IEEE Transactions on Software Engineering.

[46]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[47]  Alessandra Gorla,et al.  RepliComment: Identifying Clones in Code Comments , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[48]  Erik Linstead,et al.  Learning Lexical Features of Programming Languages from Imagery Using Convolutional Neural Networks , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[49]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[50]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[51]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[52]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[55]  Earl T. Barr,et al.  Learning Python Code Suggestion with a Sparse Pointer Network , 2016, ArXiv.

[56]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Perspectives on Data Science for Software Engineering.

[57]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[58]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[59]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[60]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[61]  Martin P. Robillard,et al.  Using Traceability Links to Recommend Adaptive Changes for Documentation Evolution , 2014, IEEE Transactions on Software Engineering.

[62]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[63]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[65]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[66]  Andreas Griewank,et al.  Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.