Survey on Feature Selection in Document Clustering

----Text mining is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide knowledge and to support in decision making. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. Clustering is a fundamental data analysis technique used for various applications such as biology, psychology, control and signal processing, information theory and mining technologies. Text mining is not a stand-alone task that human analysts typically engage in. The goal is to transform text composed of everyday language into a structured, database format. In this way, heterogeneous documents are summarized and presented in a uniform manner. Among others, the challenging problems of text clustering are big volume, high dimensionality and complex semantics.

[1]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  A. Raftery,et al.  A Framework for Feature Selection in Clustering, Journal of the American Statistical Association, 105, 713―726 , 2011 .

[3]  L. Jing Survey of Text Clustering , 2005 .

[4]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  K. R. Chandran,et al.  Integrating Swarm Intelligence and Statistical Data for Feature Selection in Text Categorization , 2010 .

[6]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[7]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[8]  Steffen Staab,et al.  Ontology-based text clustering , 2001, IJCAI 2001.

[9]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[10]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[12]  Magnus Rosell Introduction to Information Retrieval and Text Clustering , 2006 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[15]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[16]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[17]  P. Thangaraj,et al.  Integrated Clustering and Feature Selection Scheme for Text Documents. , 2010 .

[18]  Mahesh T R,et al.  TEXT MINING : ADVANCEMENTS , CHALLENGES AND FUTURE , 2010 .

[19]  Lior Rokach,et al.  A Survey of Feature Selection Techniques , 2009, Encyclopedia of Data Warehousing and Mining.

[20]  Stephan Bloehdorn,et al.  Text classification by boosting weak learners based on terms and concepts , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[21]  Ricardo B. C. Prudêncio,et al.  Local Feature Selection in Text Clustering , 2008, ICONIP.

[22]  Dino Ienco,et al.  Exploration and Reduction of the Feature Space by Hierarchical Clustering , 2008, SDM.

[23]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[24]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[25]  Wei-Ying Ma,et al.  Multitype Features Coselection for Web Document Clustering , 2006, IEEE Trans. Knowl. Data Eng..

[26]  Luiz Gonzaga,et al.  A Simple and Fast Term Selection Procedure for Text Clustering , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[27]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[28]  Sun Park,et al.  Document Clustering Method Using Weighted Semantic Features and Cluster Similarity , 2010, 2010 Third IEEE International Conference on Digital Game and Intelligent Toy Enhanced Learning.