Finding Reusable Machine Learning Components to Build Programming Language Processing Pipelines

Programming Language Processing (PLP) using machine learning has made vast improvements in the past few years. Increasingly more people are interested in exploring this promising field. However, it is challenging for new researchers and developers to find the right components to construct their own machine learning pipelines, given the diverse PLP tasks to be solved, the large number of datasets and models being released, and the set of complex compilers or tools involved. To improve the findability, accessibility, interoperability and reusability (FAIRness) of machine learning components, we collect and analyze a set of representative papers in the domain of machine learning-based PLP. We then identify and characterize key concepts including PLP tasks, model architectures and supportive tools. Finally, we show some example use cases of leveraging the reusable components to construct machine learning pipelines to solve a set of PLP tasks.

[1]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[2]  Xin Wang,et al.  CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training , 2022, NAACL-HLT.

[3]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[4]  Alexandre Muzio,et al.  Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.

[5]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[6]  Katikapalli Subramanyam Kalyan,et al.  AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing , 2021, ArXiv.

[7]  Li Li,et al.  SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation , 2021, 2108.04556.

[8]  Iqbal H. Sarker Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions , 2021, SN Computer Science.

[9]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[10]  Nan Duan,et al.  CoSQA: 20,000+ Web Queries for Code Search and Question Answering , 2021, ACL.

[11]  Lei Lyu,et al.  TreeBERT: A Tree-Based Pre-Trained Model for Programming Language , 2021, UAI.

[12]  Veronika Thost,et al.  CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks , 2021, NeurIPS Datasets and Benchmarks.

[13]  Hieu Tran,et al.  CoTexT: Multi-task Learning with Code-Text Transformer , 2021, NLP4PROG.

[14]  Danijel Skocaj,et al.  Mixed supervision for surface-defect detection: from weakly to fully supervised learning , 2021, Comput. Ind..

[15]  M. V. Koroteev BERT: A Review of Applications in Natural Language Processing and Understanding , 2021, ArXiv.

[16]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[17]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[18]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[19]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[20]  Martin Maas,et al.  A Taxonomy of ML for Systems Problems , 2020, IEEE Micro.

[21]  Joseph E. Gonzalez,et al.  Contrastive Code Representation Learning , 2020, EMNLP.

[22]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[25]  Christopher Ré,et al.  Machine Learning on Graphs: A Model and Comprehensive Taxonomy , 2020, J. Mach. Learn. Res..

[26]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[27]  Aditya Kanade,et al.  Learning and Evaluating Contextual Embedding of Source Code , 2019, ICML.

[28]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[29]  Shangqing Liu,et al.  Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , 2019, NeurIPS.

[30]  David Lo,et al.  Deep code comment generation with hybrid lexical and syntactical information , 2019, Empirical Software Engineering.

[31]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[32]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[33]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[34]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[35]  Gianluca Palermo,et al.  A Survey on Compiler Autotuning using Machine Learning , 2018, ACM Comput. Surv..

[36]  Markus Schordan,et al.  DataRaceBench: A Benchmark Suite for Systematic Evaluation of Data Race Detection Tools , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[38]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[41]  Anh Tuan Nguyen,et al.  Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[42]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[43]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[44]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[45]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[46]  Torsten Hoefler,et al.  ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations , 2021, ICML.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[49]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.