论文信息 - Fine-Grained Code-Comment Semantic Interaction Analysis

Fine-Grained Code-Comment Semantic Interaction Analysis

Code comment, i.e., the natural language text to describe code, is considered as a killer for program comprehension. Current literature approaches mainly focus on comment generation or comment update, and thus fall short on explaining which part of the code leads to a specific content in the comment. In this paper, we propose that addressing such a challenge can better facilitate code under-standing. We propose Fosterer, which can build fine-grained se-mantic interactions between code statements and comment tokens. It not only leverages the advanced deep learning techniques like cross-modal learning and contrastive learning, but also borrows the weapon of pre-trained vision models. Specifically, it mimics the comprehension practice of developers, treating code statements as image patches and comments as texts, and uses contrastive learning to match the semantically-related part between the visual and tex-tual information. Experiments on a large-scale manually-labelled dataset show that our approach can achieve an Fl-score around 80%, and such a performance exceeds a heuristic-based baseline to a large extent. We also find that Fosterer can work with a high efficiency, i.e., it only needs 1.5 seconds for inferring the results for a code-comment pair. Furthermore, a user study demonstrates its usability: for 65% cases, its prediction results are considered as useful for improving code understanding. Therefore, our research sheds light on a promising direction for program comprehension.

[1] Ming Wen,et al. Context-Aware Code Change Embedding for Better Patch Correctness Assessment , 2022, ACM Trans. Softw. Eng. Methodol..

[2] Ting Wang,et al. Data Augmentation by Program Transformation , 2022, J. Syst. Softw..

[3] Xiangke Liao,et al. Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding , 2021, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[4] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[5] Somesh Jha,et al. Semantic Robustness of Models of Source Code , 2020, 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[6] Jacques Klein,et al. Beep: Fine-grained Fix Localization by Learning to Predict Buggy Code Elements , 2021, ArXiv.

[7] Zheng Li,et al. Extended Abstract of SeCNN: A semantic CNN parser for code comment generation , 2021, 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[8] Yihao Qin,et al. Peculiar: Smart Contract Vulnerability Detection Based on Crucial Data Flow Graph and Pre-training Techniques , 2021, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE).

[9] Dongmei Zhang,et al. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees , 2021, EMNLP.

[10] Venera Arnaoudova,et al. Reassessing automatic evaluation metrics for code summarization tasks , 2021, ESEC/SIGSOFT FSE.

[11] Xiaoguang Mao,et al. Lightweight global and local contexts guided method name recommendation with prior knowledge , 2021, ESEC/SIGSOFT FSE.

[12] Xiaoguang Mao,et al. Automated Comment Update: How Far are We? , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[13] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[14] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[15] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[16] Mohammad Amin Alipour,et al. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations , 2020, Inf. Softw. Technol..

[17] Zibin Zheng,et al. Towards automatically generating block comments for code snippets , 2020, Inf. Softw. Technol..

[18] Meng Yan,et al. Automating Just-In-Time Comment Updating , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19] Bo Lin,et al. Automated Patch Correctness Assessment: How Far are We? , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20] Zachary Eberhart,et al. A Human Study of Comprehension and Code Summarization , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[21] M. Zaharia,et al. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[22] Collin McMillan,et al. Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[23] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[24] Beijun Shen,et al. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[25] Zhi Jin,et al. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[26] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[27] Quoc V. Le,et al. Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[28] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[29] Y. Srikant,et al. IR2VEC , 2020, ACM Trans. Archit. Code Optim..

[30] Saeed Hassanpour,et al. Generative Image Translation for Data Augmentation in Colorectal Histopathology Images , 2019, ML4H@NeurIPS.

[31] Philip S. Yu,et al. Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32] Marc Brockschmidt,et al. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[33] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[34] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35] Taghi M. Khoshgoftaar,et al. A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[36] Konrad Rieck,et al. Misleading Authorship Attribution of Source Code using Adversarial Learning , 2019, USENIX Security Symposium.

[37] Gabriele Bavota,et al. A Large-Scale Empirical Study on Code-Comment Inconsistencies , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[38] Yves Le Traon,et al. Learning to Spot and Refactor Inconsistent Method Names , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[39] Koushik Sen,et al. When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[40] Long Chen,et al. Neural Detection of Semantic Code Clones Via Tree-Based Convolution , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[41] Alberto Bacchelli,et al. Classifying code comments in Java software systems , 2019, Empirical Software Engineering.

[42] Collin McMillan,et al. A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[43] Koushik Sen,et al. Aroma: code recommendation via structural code search , 2018, Proc. ACM Program. Lang..

[44] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[45] Zhenchang Xing,et al. Measuring Program Comprehension: A Large-Scale Field Study with Professionals , 2018, IEEE Transactions on Software Engineering.

[46] Gabriele Bavota,et al. Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[47] Alessandra Gorla,et al. RepliComment: Identifying Clones in Code Comments , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[48] Erik Linstead,et al. Learning Lexical Features of Programming Languages from Imagery Using Convolutional Neural Networks , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[49] David Lo,et al. Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[50] Yue Wang,et al. Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[51] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[52] Ming Li,et al. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[53] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[54] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[55] Earl T. Barr,et al. Learning Python Code Suggestion with a Sparse Pointer Network , 2016, ArXiv.

[56] Premkumar T. Devanbu,et al. On the naturalness of software , 2016, Perspectives on Data Science for Software Engineering.

[57] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[58] Tao Wang,et al. Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[59] Dongmei Zhang,et al. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[60] Premkumar T. Devanbu,et al. On the localness of software , 2014, SIGSOFT FSE.

[61] Martin P. Robillard,et al. Using Traceability Links to Recommend Adaptive Changes for Documentation Evolution , 2014, IEEE Transactions on Software Engineering.

[62] Eran Yahav,et al. Code completion with statistical language models , 2014, PLDI.

[63] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64] Premkumar T. Devanbu,et al. On the naturalness of software , 2016, Commun. ACM.

[65] Sushil Krishna Bajracharya,et al. Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[66] Andreas Griewank,et al. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.