论文信息 - Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering

Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering

Text mining defines generally the process of extracting interesting features (non-trivial) and knowledge from unstructured text documents. Text mining is an interdisciplinary field which depends on information retrieval, data mining, machine learning, parameter statistics and computational linguistics. Standard text mining and retrieval information techniques of text document usually rely on similar categories. An alternative method of retrieving information is clustering documents to preprocess text. The preprocessing steps have a huge effect on the success to extract knowledge. This study implements TF-IDF and singular value decomposition (SVD) dimensionality reduction techniques. The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm. Finally, the experimental results show that the proposed method enhances the performance of English text document clustering. Simulation results on BBC news and BBC sport datasets show the superiority of the proposed algorithm.

Ammar Ismael Kadhim | Yu.-N Cheah | Nurul Hashimah Ahamed | Y. Cheah

[1] Donald K. Wedding,et al. Discovering Knowledge in Data, an Introduction to Data Mining , 2005, Inf. Process. Manag..

[2] S. Ramasundaram,et al. Text Categorization by Backpropagation Network , 2010 .

[3] Syed Sibte Raza Abidi,et al. A multi-phase correlation search framework for mining non-taxonomic relations from unstructured text , 2012, Knowledge and Information Systems.

[4] Rafael E. Banchs. Text Mining with MATLAB® , 2012, Springer New York.

[5] Vipin Kumar,et al. Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[6] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[7] Jiawei Han,et al. Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[8] Rizwan Ahmad. Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK , 2010 .