Text feature extraction based on joint conditional entropy

It is an important task for data mining and summarizing to extracting features of data. The task of extracting text feature is to extract useful information from texts with identifying and exploring interested patterns. We propose a strategy to extracting feature based on joint conditional entropy and genetic algorithm. Joint conditional entropy is the uncertainty measure of a set of variables given conditions. It is used to get the feature words which represent texts. Genetic algorithm has been applied successfully in many fields. The algorithm is useful for obtaining solutions of optimizing search problems. In this paper, we firstly preprocess texts in order to get the words, then, present the joint conditional entropy which can be applied to define the fitness function of genetic algorithm for discovering proper words which can represent texts. Finally, experimental result shows that this approach is suitable for extracting ideal features of text.

[1]  Jinhua Zheng,et al.  Research on Feature Extraction Based on Genetic Algorithm in Text Categorization , 2010, 2010 International Conference on Computational Intelligence and Security.

[2]  Pu Jie-xin Research about Algorithm of Web Text Feather Selection , 2005 .

[3]  Nan Zhang,et al.  Hierarchical rough decision theoretic framework for text classification , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[4]  He Ting-ting Research on Text Feature Extraction Based on Hybrid Parallel Genetic Algorithm , 2008 .

[5]  Wei Jiang,et al.  An Improved Document Classification Approach with Maximum Entropy and Entropy Feature Selection , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[6]  Yuemei Shi,et al.  Study to Genetic Algorithms for Data Mining Optimization , 2009, 2009 International Conference on Management and Service Science.

[7]  Ronen Feldman,et al.  The Text Mining Handbook: Index , 2006 .

[8]  Xin-fu Li,et al.  A feature extraction method using base phrase and keyword in Chinese text , 2008, 2008 3rd International Conference on Intelligent System and Knowledge Engineering.

[9]  Sun Hao-ju A Clustering Algorithm Based on Entropy Weight for Mixed Numeric and Categorical Data , 2013 .