Text clustering using statistical and semantic data

The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. However, it represents a challenge when dealing with a big number of data due to high dimensionality of the feature space and to the semantic correlation between features. In this paper, we propose a new sequential document clustering algorithm that uses a statistical and semantic feature selection methods. The semantic process was proposed to improve the frequency mechanism with the semantic relations of the text documents. The proposed algorithm selects iteratively relevant features and performs clustering until convergence. To evaluate its performance, experiments on two corpora have been conducted. The obtained results show that the performance of our algorithm is superior to that obtained by the existing algorithms.

[1]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  P. Thangaraj,et al.  Integrated Clustering and Feature Selection Scheme for Text Documents. , 2010 .

[4]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[5]  B. Ouhbi,et al.  A hybrid method for improving the SQD-PageRank algorithm , 2012, Second International Conference on the Innovative Computing Technology (INTECH 2012).

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Yu Xiao A Survey of Document Clustering Techniques & Comparison of LDA and moVMF , 2010 .

[8]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[9]  Brahim Ouhbi,et al.  A New Methodology for Domain Ontology Construction from the Web , 2011, Int. J. Artif. Intell. Tools.

[10]  Pedro M. Domingos,et al.  Programming by Demonstration Using Version Space Algebra , 2003, Machine Learning.

[11]  Filiberto Pla,et al.  Supervised feature selection by clustering using conditional mutual information-based distances , 2010, Pattern Recognit..

[12]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[13]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[14]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[15]  R. Srihari,et al.  Optimally Combining Positive and Negative Features for Text Categorization , 2003 .

[16]  Alexandre Termier,et al.  Combining Statistics and Semantics for Word and Document Clustering , 2001, Workshop on Ontology Learning.

[17]  K. R. Chandran,et al.  Integrating Swarm Intelligence and Statistical Data for Feature Selection in Text Categorization , 2010 .

[18]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[19]  Ido Dagan,et al.  Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[20]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[21]  K. Sathiyakumari,et al.  A Survey on Various Approaches in Document Clustering , 2011 .

[22]  Rizwan Ahmad Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK , 2010 .

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.