[Engineering Paper] SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.

[1]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Shlok Gilda,et al.  Source code classification using Neural Networks , 2017, 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  G. Srinivasaraghavan,et al.  Detecting Programming Language from Source Code Using Bayesian Learning Techniques , 2014, MLDM.

[5]  David Klein,et al.  Algorithmic Programming Language Identification , 2011, ArXiv.

[6]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[7]  Michael W. Godfrey,et al.  What's hot and what's not: Windowed developer topic analysis , 2009, 2009 IEEE International Conference on Software Maintenance.

[8]  Vadim Zaytsev,et al.  Software Language Identification with Natural Language Classifiers , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[9]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[10]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[11]  Erik Linstead,et al.  A Deep Learning Approach to Identifying Source Code in Images and Video , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[12]  Jorge E. Camargo,et al.  Predicting the Programming Language: Extracting Knowledge from Stack Overflow Posts , 2017 .