Feature Selection Techniques and Classification Accuracy of Supervised Machine Learning in Text Mining

Text mining is a special case of data mining which explore unstructured or semi-structured text documents, to establish valuable patterns and rules that indicate trends and significant features about specific topics. Text mining has been in pattern recognition, predictive studies, sentiment analysis and statistical theories in many areas of research, medicine, financial analysis, social life analysis, and business intelligence. Text mining uses concept of natural language processing and machine learning. Machine learning algorithms have been used and reported to give great results, but their performance of machine learning algorithms is affected by factors such as dataset domain, number of classes, length of the corpus, and feature selection techniques used. Redundant attribute affects the performance of the classification algorithm, but this can be reduced by using different feature selection techniques and dimensionality reduction techniques.  Feature selection is a data preprocessing step that chooses a subset of input variable while eliminating features with little or no predictive information. Feature selection techniques are Information gain, Term Frequency, Term Frequency-Inverse document frequency, Mutual Information, and Chi-Square, which can use a filters, wrappers, or embedded approaches. To get the most value from machine learning, pairing the best algorithms with the right tools and processes is necessary. Little research has been done on the effect of feature selection techniques on classification accuracy for pairing of these algorithms with the best feature selection techniques for optimal results. In this research, a text classification experiment was conducted using incident management dataset, where incidents were classified into their resolver groups. Support vector machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB) and Decision tree (DT) machine learning algorithms were examined. Filtering approach was used on the feature selection techniques, with different ranking indices applied for optimal feature set and classification accuracy results analyzed. The classification accuracy results obtained using TF were, 88% for SVM, 70% for NB, 79% for Decision tree, and KNN had 55%, while Boolean registered 90%, 83%, 82% and 75%, for SVM, NB, DT, and KNN respectively. TF-IDF, had 91%, 83%, 76%, and 56% for SVM, NB, DT, and KNN respectively. The results showed that algorithm performance is affected by feature selection technique applied. SVM performed best, followed by DT, KNN and finally NB. In conclusion, presence of noisy data leads to poor learning performance and increases the computational time. The classifiers performed differently depending on the feature selection technique applied. For optimal results, the classifier that performed best together with the feature selection technique with the best feature subset should be applied for all types of data for accurate classification performance. Keywords: Text Classification, Supervised Machine Learning, Feature Selection DOI : 10.7176/JIEA/9-3-06 Publication date :May 31 st 2019