CodeQA: A Question Answering Dataset for Source Code Comprehension

We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction process and conduct systematic analysis of our dataset. Experiment results achieved by several neural baselines on our dataset are shown and discussed. While research on question-answering and machine reading comprehension develops rapidly, few prior work has drawn attention to code question answering. This new dataset can serve as a useful research benchmark for source code comprehension.

[1]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3]  Jian Peng,et al.  emrQA: A Large Corpus for Question Answering on Electronic Medical Records , 2018, EMNLP.

[4]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[5]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[6]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[7]  Christopher D. Manning,et al.  Syn-QG: Syntactic and Shallow Semantic Rules for Question Generation , 2020, ACL.

[8]  Hiroyuki Shindo,et al.  LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention , 2020, EMNLP.

[9]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[10]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[11]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[12]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[13]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[14]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[15]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Akihiro Yamamoto,et al.  Automatic Source Code Summarization with Extended Tree-LSTM , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[19]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[20]  Alberto Bacchelli,et al.  Classifying Code Comments in Java Open-Source Software Systems , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[21]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[22]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[23]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Michael Flor,et al.  A Semantic Role-based Approach to Open-Domain Automatic Question Generation , 2018, BEA@NAACL-HLT.

[27]  Zachary Eberhart,et al.  A Neural Question Answering System for Basic Questions about Subroutines , 2021, 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[30]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[31]  Jungang Xu,et al.  A Survey on Neural Machine Reading Comprehension , 2019, ArXiv.

[32]  Peter Chin,et al.  Tree-Transformer: A Transformer-Based Method for Correction of Tree-Structured Data , 2019, ArXiv.

[33]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[34]  Jure Leskovec,et al.  Language-Agnostic Representation Learning of Source Code from Structure and Context , 2021, ICLR.

[35]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[37]  Xin Xia,et al.  Code Generation as a Dual Task of Code Summarization , 2019, NeurIPS.

[38]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Richard R. Day,et al.  Developing Reading Comprehension Questions. , 2005 .

[41]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[42]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[43]  Preslav Nakov,et al.  EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering , 2020, EMNLP.

[44]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[45]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[46]  Weiming Zhang,et al.  Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[47]  Weifeng Zhang,et al.  CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[48]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[49]  Peter Clark,et al.  SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[50]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[51]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.