论文信息 - AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges

AlgoLabel: A Large Dataset for Multi-Label Classification of Algorithmic Challenges

While semantic parsing has been an important problem in natural language processing for decades, recent years have seen a wide interest in automatic generation of code from text. We propose an alternative problem to code generation: labelling the algorithmic solution for programming challenges. While this may seem an easier task, we highlight that current deep learning techniques are still far from offering a reliable solution. The contributions of the paper are twofold. First, we propose a large multi-modal dataset of text and code pairs consisting of algorithmic challenges and their solutions, called AlgoLabel. Second, we show that vanilla deep learning solutions need to be greatly improved to solve this task and we propose a dual text-code neural model for detecting the algorithmic solution type for a programming challenge. While the proposed text-code model increases the performance of using the text or code alone, the improvement is rather small highlighting that we require better methods to combine text and code features.

[1] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[2] Manish Shrivastava,et al. Predicting Algorithm Classes for Programming Word Problems , 2019, EMNLP.

[3] Graham Neubig,et al. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[4] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[5] Graham Neubig,et al. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation , 2018, EMNLP.

[6] Timofey Bryksin,et al. PathMiner: A Library for Mining of Path-Based Representations of Code , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[7] Ke Wang,et al. Blended, precise semantic program embeddings , 2020, PLDI.

[8] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9] Philip S. Yu,et al. Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification , 2019, IEEE Transactions on Knowledge and Data Engineering.

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11] Grigorios Tsoumakas,et al. Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[12] Luke Zettlemoyer,et al. JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation , 2019, EMNLP.

[13] Uri Alon,et al. code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[14] Tomoki Toda,et al. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15] Hai Ye,et al. Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization , 2019, ACL.

[16] Enhong Chen,et al. Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach , 2019, CIKM.

[17] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[18] Leo Breiman,et al. Random Forests , 2001, Machine Learning.