Enriching Query Semantics for Code Search with Reinforcement Learning

Code search is a common practice for developers during software implementation. The challenges of accurate code search mainly lie in the knowledge gap between source code and natural language (i.e., queries). Due to the limited code-query pairs and large code-description pairs available, the prior studies based on deep learning techniques focus on learning the semantic matching relation between source code and corresponding description texts for the task, and hypothesize that the semantic gap between descriptions and user queries is marginal. In this work, we found that the code search models trained on code-description pairs may not perform well on user queries, which indicates the semantic distance between queries and code descriptions. To mitigate the semantic distance for more effective code search, we propose QueCos, a Query-enriched Code search model. QueCos learns to generate semantic enriched queries to capture the key semantics of given queries with reinforcement learning (RL). With RL, the code search performance is considered as a reward for producing accurate semantic enriched queries. The enriched queries are finally employed for code search. Experiments on the benchmark datasets show that QueCos can significantly outperform the state-of-the-art code search models.

[1]  Shaohua Wang,et al.  Improving bug detection via context-based code representation learning and attention-based neural networks , 2019, Proc. ACM Program. Lang..

[2]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[3]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[4]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[5]  SOTorrent , 2018, Proceedings of the 15th International Conference on Mining Software Repositories.

[6]  Swarat Chaudhuri,et al.  Neural query expansion for code search , 2019, MAPL@PLDI.

[7]  R. Bellman A Markovian Decision Process , 1957 .

[8]  Xiaochen Li,et al.  Query Expansion Based on Crowd Knowledge for Code Search , 2016, IEEE Transactions on Services Computing.

[9]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[11]  Wei Ye,et al.  Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning , 2020, WWW.

[12]  Anh Tuan Nguyen,et al.  Bug Localization with Combination of Deep Learning and Information Retrieval , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[13]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[14]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[15]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[16]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[17]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[18]  Ming Cheng,et al.  Deep learning the semantics of change sequences for query expansion , 2019, Softw. Pract. Exp..

[19]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[20]  Zenglin Xu,et al.  CRaDLe: Deep Code Retrieval Based on Semantic Dependency Learning , 2020, ArXiv.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[25]  Zhi Jin,et al.  Modular Tree Network for Source Code Representation Learning , 2020, ACM Trans. Softw. Eng. Methodol..

[26]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[27]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[28]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[29]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[30]  Christoph Treude,et al.  SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[31]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[32]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[33]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[34]  Zeyu Sun,et al.  OCoR: An Overlapping-Aware Code Retriever , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[35]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).