Genetic Algorithm and Confusion Matrix for Document Clustering

Text mining is one of the most important tools in Information Retrieval. Text clustering is the process of classifying documents into predefined categories according to their content. Existing supervised learning algorithms to automatically classify text requires sufficient documentation to learn exactly. In this paper, Niching memetic algorithm and Genetic algorithm (GA) is presented in which feature selection an integral part of the global clustering search procedure that attempts to overcome the problem of finding optimal solutions at the local less promising in both clustering and feature selection. The concept of confusion matrix is then used for derivative works, and finally, hybrid GA is included for the final classification. Experimental results show benefits by using the proposed method which evaluates F-measure, purity and results better performance in terms of False positive, False negative, True positive and True negative.

[1]  Kusum Deep,et al.  Quadratic approximation based hybrid genetic algorithm for function optimization , 2008, Appl. Math. Comput..

[2]  Chih-Ping Wei,et al.  Combining preference- and content-based approaches for improving document clustering effectiveness , 2006, Inf. Process. Manag..

[3]  Weiguo Sheng,et al.  Clustering with Niching Genetic K-means Algorithm , 2004, GECCO.

[4]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[5]  Zhi-Hua Zhou,et al.  Distributional features for text categorization , 2006 .

[6]  Zeev Volkovich,et al.  Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[7]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[8]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[9]  P. Harini,et al.  A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification , 2012 .

[10]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Maurizio Marchese,et al.  Text Clustering with Seeds Affinity Propagation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Xiang Zhang,et al.  CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition , 2008, SIGMOD Conference.

[13]  Cheng-Yan Kao,et al.  An evolutionary approach for gene expression patterns , 2004, IEEE Transactions on Information Technology in Biomedicine.

[14]  A. Santra,et al.  Cluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization , 2011 .

[15]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.