Classification of Programming Problems based on Topic Modeling

Programming skill is one of the most important and demanding skill in the current generation. In order to enable learners and programmers to practice programming and gain problem-solving skills, many Online Judge (OJ) systems exist. Most of these OJ systems have to be operated solely by students and learners. These students and novice programmers sometimes compete against each other or solve the programming problems by themselves in offline mode. But, most OJ systems have their problems arranged simply into volumes and various contests events. This arrangement system does not have any clear indication of the difficulties and categories of problems. Thus, in this paper, we have studied reliable techniques on the extraction of keywords and features which can categorize these OJ system's programming problems into their respective types and skills. We have leveraged two popular topic modeling algorithms, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract relevant features. Afterward, six classifiers were trained on these topic modeling features and Naive TF-IDF features. From our studies, we discovered that topic modeling features were relatively smaller in dimensionality, yet matched the performance when trained on high dimensional naive TF-IDF features. Our main goal was to understand the precise trade-off between accuracy and dimensionality of the textual data of programming problem statements. This experiment has enabled us to obtain important tags, hint, and classification of Online Judge programming problems.

[1]  Chengqi Zhang,et al.  Association Rule Mining , 2002, Lecture Notes in Computer Science.

[2]  Szymon Wasik,et al.  A Survey on Online Judge Systems and Their Applications , 2017, ACM Comput. Surv..

[3]  Fan Yang,et al.  Data Analysis Center Based on E-Learning Platform , 2002 .

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Wenfei Chen,et al.  Research on three-layer collaborative filtering recommendation for Online Judge , 2016, 2016 Seventh International Green and Sustainable Computing Conference (IGSC).

[6]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[7]  Osmar R. Zaïane,et al.  Web Usage Mining for a Better Web-Based Learning Environment , 2001 .

[8]  Luis Martínez-López,et al.  A Recommender System for Programming Online Judges Using Fuzzy Information Modeling , 2018, Informatics.

[9]  Ana Paula Ambrosio,et al.  Mining an Online Judge System to Support Introductory Computer Programming Teaching , 2015, EDM.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Terry Scott,et al.  Using student surveys in determining the difficulty of programming assignments , 2010 .

[12]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[13]  Dirk H. R. Spennemann,et al.  Patterns of user behavior in University on-line forums , 2004 .

[14]  Carlos Fernandez-Medina,et al.  Assistance in computer programming learning using educational data mining and learning analytics , 2013, ITiCSE 2013.

[15]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[16]  Yutaka Watanobe,et al.  Classification of Online Judge Programmers based on Rule Extraction from Self Organizing Feature Map , 2018, 2018 9th International Conference on Awareness Science and Technology (iCAST).

[17]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[18]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[19]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[20]  Shichao Zhang,et al.  Association Rule Mining: Models and Algorithms , 2002 .

[21]  Yannis Psaromiligkos,et al.  Towards Networked Learning Analytics - A concept and a tool , 2006 .

[22]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[23]  Riccardo Mazza,et al.  GISMO: a Graphical Interactive Student Monitoring Tool for Course Management Systems , 2004 .

[24]  Xing Xie,et al.  Automatically Learning Topics and Difficulty Levels of Problems in Online Judge Systems , 2018, ACM Trans. Inf. Syst..

[25]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[26]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[27]  Luis Martínez-López,et al.  A recommendation approach for programming online judges supported by data preprocessing techniques , 2017, Appl. Intell..

[28]  Fan Yang,et al.  Data Mining and Case-Based Reasoning for Distance Learning , 2003, Int. J. Distance Educ. Technol..

[29]  Yutaka Watanobe,et al.  Cluster Analysis to Estimate the Difficulty of Programming Problems , 2018, Proceedings of the 3rd International Conference on Applications in Information Technology.

[30]  Andrew Lim,et al.  Online Judge , 2001, Comput. Educ..

[31]  Michal Burda,et al.  Visualization of Differences in Data Measuring Mathematical Skills , 2009, EDM.

[32]  Yailé Caballero Mota,et al.  An e-Learning Collaborative Filtering Approach to Suggest Problems to Solve in Programming Online Judges , 2014, Int. J. Distance Educ. Technol..

[33]  Zhenzhong Li,et al.  News text classification model based on topic model , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[34]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[35]  Mykola Pechenizkiy,et al.  Process Mining Online Assessment Data , 2009, EDM.