论文信息 - CoTexT: Multi-task Learning with Code-Text Transformer - 字舞流文

CoTexT: Multi-task Learning with Code-Text Transformer

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both “bimodal” and “unimodal” data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

Hieu Tran | Hieu Nguyen | Long Phan | James Anibal | Alec Peltekian | Yanfang Ye | Daniel Le | Yanfang Ye | J. Anibal | A. Peltekian | Long Phan | Alec Peltekian | Daniel Le | H. Tran | H. Nguyen

[1] Ivan Vulić,et al. Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems , 2019, EMNLP.

[2] Ming Zhou,et al. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , 2020, ArXiv.

[3] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[4] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5] Shangqing Liu,et al. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , 2019, NeurIPS.

[6] Chin-Yew Lin,et al. ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8] Kyle Lo,et al. SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[9] Alvin Cheung,et al. Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[10] Graham Neubig,et al. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[11] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[13] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[14] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[15] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[16] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[17] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[18] Kai-Wei Chang,et al. Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[19] Neel Sundaresan,et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Ming Zhou,et al. GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[22] Xiaocheng Feng,et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[23] Guillaume Lample,et al. Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[24] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[25] Marc Brockschmidt,et al. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.