Unsupervised feature selection technique based on genetic algorithm for improving the Text Clustering

The increasing amount of text documents in digital forms affect the text analysis techniques. Text clustering (TC) is one of the important techniques used for showing a massive amount of text documents by clusters. Hence, the main problem that affects the text clustering technique is the presence sparse and uninformative features on the text documents. The feature selection (FS) is an essential unsupervised learning technique. This technique is used to select informative features to improve the performance of text clustering algorithm. Recently, the meta-heuristic algorithms are successfully applied to solve several hard optimization problems. In this paper, we proposed the genetic algorithm (GA) to solve the unsupervised feature selection problem, namely, (FSGATC). This method is used to create a new subset of informative features in order to obtain more accurate clusters. Experiments were conducted using four benchmark text datasets with variant characteristics. The results showed that the proposed FSGATC is improved the performance of the text clustering algorithm and got better results compared with k-mean clustering standalone. Finally, the proposed method “FSGATC” evaluated by F-measure and Accuracy, which are common measures used in the domain of text clustering.

[1]  Shi Gao,et al.  Text clustering based on the improved TFIDF by the iterative algorithm , 2012, 2012 IEEE Symposium on Electrical & Electronics Engineering (EEESYM).

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[4]  Laith Mohammad Abualigah,et al.  APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL , 2015 .

[5]  Lei Wang,et al.  On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  Yafei Wang,et al.  Notice of RetractionAn Improved Genetic Algorithm for Text Feature Selection , 2010, 2010 International Conference on Intelligent Computing and Cognitive Informatics.

[7]  Mohammad Reza Meybodi,et al.  Efficient stochastic algorithms for document clustering , 2013, Inf. Sci..

[8]  Pramod Kumar Singh,et al.  A three-stage unsupervised dimension reduction method for text clustering , 2014, J. Comput. Sci..

[9]  Pramod Kumar Singh,et al.  Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering , 2015, Expert Syst. Appl..

[10]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[11]  Pramod Kumar Singh,et al.  Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering , 2016, Appl. Soft Comput..

[12]  Mohd Saberi Mohamad,et al.  FEATURE SELECTION METHOD USING GENETIC ALGORITHM FOR THE CLASSIFCATION OF SMALL AND HIGH DIMENSION DATA , 2012 .

[13]  Mohammad Saraee,et al.  A new unsupervised feature selection method for text clustering based on genetic algorithms , 2012, Journal of Intelligent Information Systems.

[14]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..