Text Document Clustering Using Memetic Feature Selection

With the wide increase of the volume of electronic documents, it becomes inevitable the need to invent more sophisticated machine learning methods to manage the issue. In this paper, a Memetic feature selection technique is proposed to improve the k-means and the spherical k-means clustering algorithms. The proposed Memetic feature selection technique combines the wrapper inductive method with the filter ranking method. The internal and external clustering evaluation measures are used to assess the resulted clusters. The test results showed that after using the proposed hybrid method, the resulted clusters were more accurate and more compacted in comparison to the clusters resulted from using the GA-selected feature or using the entire feature space.

[1]  Kevin Kok Wai Wong,et al.  Classification of adaptive memetic algorithms: a comparative study , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  Tao Guo,et al.  Adaptive Affinity Propagation Clustering , 2008, ArXiv.

[3]  Zong Woo Geem,et al.  A New Heuristic Optimization Algorithm: Harmony Search , 2001, Simul..

[4]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[5]  Chu-Sing Yang,et al.  A time-efficient pattern reduction algorithm for k-means clustering , 2011, Inf. Sci..

[6]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[7]  Dae-Won Kim,et al.  Memetic feature selection algorithm for multi-label classification , 2015, Inf. Sci..

[8]  Sadaaki Miyamoto,et al.  Spherical k-Means++ Clustering , 2015, MDAI.

[9]  Yancheng He,et al.  A Two-layer Text Clustering Approach for Retrospective News Event Detection , 2010, 2010 International Conference on Artificial Intelligence and Computational Intelligence.

[10]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[11]  OnanAytuğ,et al.  A feature selection model based on genetic rank aggregation for text sentiment classification , 2017 .

[12]  Fabrizio Sebastiani,et al.  An analysis of the relative hardness of Reuters-21578 subsets: Research Articles , 2005 .

[13]  Mohammad Reza Meybodi,et al.  Efficient stochastic algorithms for document clustering , 2013, Inf. Sci..

[14]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[15]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[16]  Mohammad Reza Meybodi,et al.  Hybridization of K-Means and Harmony Search Methods for Web Page Clustering , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[17]  Eduardo R. Hruschka,et al.  Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection , 2013, IEEE Transactions on Information Forensics and Security.

[18]  Pablo Moscato,et al.  On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts : Towards Memetic Algorithms , 1989 .

[19]  Abdur Rehman,et al.  Relative discrimination criterion - A novel feature ranking method for text data , 2015, Expert Syst. Appl..

[20]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[21]  Aytug Onan,et al.  A feature selection model based on genetic rank aggregation for text sentiment classification , 2017, J. Inf. Sci..

[22]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[23]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[24]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[25]  Delbert Dueck,et al.  Affinity Propagation: Clustering Data by Passing Messages , 2009 .

[26]  J. L. Rana,et al.  Text Document Clustering based on Phrase Similarity using Affinity Propagation , 2013 .

[27]  Ratna Babu Chinnam,et al.  mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[28]  Spiridon D. Likothanassis,et al.  Best terms: an efficient feature-selection algorithm for text categorization , 2005, Knowledge and Information Systems.

[29]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .