An Improved Similarity Matching based Clustering Framework for Short and Sentence Level Text

Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.

[1]  Dongyang Li,et al.  Research on the parallel text clustering algorithm based on the semantic tree , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[2]  Shie-Jue Lee,et al.  Multilabel Text Categorization Based on Fuzzy Relevance Clustering , 2014, IEEE Transactions on Fuzzy Systems.

[3]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[4]  Lemin Li,et al.  High performance genetic algorithm based text clustering using parts of speech and outlier elimination , 2012, Applied Intelligence.

[5]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[6]  Shahana Bano,et al.  Partial Context Similarity of Gene/Proteins in Leukemia Using Context Rank Based Hierarchical Clustering Algorithm , 2015 .

[7]  Jianping Zeng,et al.  Text stream clustering algorithm based on adaptive feature selection , 2011, Expert Syst. Appl..

[8]  Kostadin Koroutchev,et al.  Discovering Data Set Nature through Algorithmic Clustering Based on String Compression , 2015, IEEE Transactions on Knowledge and Data Engineering.

[9]  Lu Liu,et al.  A novel incremental conceptual hierarchical text clustering method using CFu-tree , 2015, Appl. Soft Comput..

[10]  Sinh Hoa Nguyen,et al.  On semantic evaluation of text clustering algorithms , 2014, 2014 IEEE International Conference on Granular Computing (GrC).

[11]  Mao-ting Gao,et al.  Text clustering ensemble based on genetic algorithms , 2012, 2012 International Conference on Systems and Informatics (ICSAI2012).

[12]  Wei Song,et al.  Fuzzy control GA with a novel hybrid semantic similarity strategy for text clustering , 2014, Inf. Sci..

[13]  Ravi kumar Venkatesh,et al.  Legal Documents Clustering and Summarization using Hierarchical Latent Dirichlet Allocation , 2013 .

[14]  Yong Shi,et al.  Text Clustering Based on a Divide and Merge Strategy , 2015, ITQM.

[15]  Pramod Kumar Singh,et al.  A three-stage unsupervised dimension reduction method for text clustering , 2014, J. Comput. Sci..

[16]  Russell C. Eberhart,et al.  Swarm intelligence for permutation optimization: a case study of n-queens problem , 2003, Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS'03 (Cat. No.03EX706).

[17]  Dechang Pi,et al.  Chinese Text Clustering Algorithm Based k-means , 2012 .