论文信息 - The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even equivalent. One way to deal with this problem is to cluster these questions depending on their topic. In our previous work, we aimed to find an improved method of clustering that would give us a high efficiency, using a recurring LDA model. Our data set contained questions posted online from a Computer Science course at the University of Bath. A significant number of these questions contained code excerpts, which we found caused a problem in clustering, as certain terms were being considered as common words in the English language and not being recognised as specific code terms. To address this, we implemented tagging of these technical terms using Python, as part of preprocessing the data set. In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results in order to justify our reasoning.

James Davenport | Alexandra Gkolia | Nikhil Fernandes | Nicolas Pizzo | Akshar Nair

[1] Lech J. Janczewski,et al. Clustering and Topic Modelling: A New Approach for Analysis of National Cyber security Strategies , 2017, PACIS.

[2] Chuchi Montenegro,et al. Using Latent Dirichlet Allocation for Topic Modeling and Document Clustering of Dumaguete City Twitter Dataset , 2018 .

[3] Ilya Safro,et al. Clustered Latent Dirichlet Allocation for Scientific Discovery , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[4] Max Welling,et al. Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[5] Kaveh Bastani,et al. Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints , 2018, Expert Syst. Appl..

[6] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7] Kadir A. Peker,et al. Extracting Turkish tweet topics using LDA , 2013, 2013 8th International Conference on Electrical and Electronics Engineering (ELECO).

[8] Michael I. Jordan,et al. Hierarchical Dirichlet Processes , 2006 .

[9] P. Donnelly,et al. Inference of population structure using multilocus genotype data. , 2000, Genetics.