Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 77.7%. These results show that deploying Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.

[1]  N. Divya,et al.  A Hybrid Auto-tagging System for StackOverflow Forum Questions , 2014, ICONIAAC '14.

[2]  Robert J. Walker,et al.  Strathcona example recommendation tool , 2005, ESEC/FSE-13.

[3]  Clayton Stanley Predicting Tags for StackOverflow Posts , 2013 .

[4]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[5]  Ashish Sureka,et al.  Chaff from the wheat: characterization and modeling of deleted questions on stack overflow , 2014, WWW.

[6]  G. Srinivasaraghavan,et al.  Detecting Programming Language from Source Code Using Bayesian Learning Techniques , 2014, MLDM.

[7]  Frank Maurer,et al.  What makes a good code example?: A study of programming Q&A in StackOverflow , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[8]  Collin McMillan,et al.  Portfolio: a search engine for finding functions and their usages , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Bogdan Dit,et al.  Using Data Fusion and Web Mining to Support Feature Location in Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[12]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[13]  Chanchal Kumar Roy,et al.  Answering questions about unanswered questions of Stack Overflow , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[14]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[15]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Emad Shihab,et al.  What are mobile developers asking about? A large scale study using stack overflow , 2016, Empirical Software Engineering.

[18]  Shalabh Statistical Learning from a Regression Perspective , 2009 .

[19]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[20]  Vadim Zaytsev,et al.  Software Language Identification with Natural Language Classifiers , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[21]  Gustavo Pinto,et al.  Mining questions about software energy consumption , 2014, MSR 2014.

[22]  Fermín Moscoso del Prado Martín,et al.  You can take a noun out of syntax...: Syntactic similarity effects in lexical priming , 2017, CogSci.

[23]  Kevin A. Schneider,et al.  A discriminative model approach for suggesting tags automatically for Stack Overflow questions , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Carolyn B. Seaman,et al.  The information gathering strategies of software maintainers , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[27]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[28]  Lena Mamykina,et al.  Design lessons from the fastest q&a site in the west , 2011, CHI.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Jorge E. Camargo,et al.  Predicting the Programming Language: Extracting Knowledge from Stack Overflow Posts , 2017 .