The Exploration and Application of K-medoids in Text Clustering

Clustering algorithms is a statistical analysis method for classifying samples/indexes. The traditional text clustering algorithm is complicated and not convenient for data processing. Therefore, we proposed a new text clustering algorithm based on K-medoids. The new text clustering algorithm combines document category with semantics contribution. The new clustering algorithm can not only optimize the document frequency, but also take consideration of influence of the document category on the characteristic weight. The new text clustering algorithm was shown as follows: first, combine the proposed semantic contribution with fuzzy cluster, and vested the document (with no category information) category thereby; then we proposed the category information entropy and combined it with the semantic contribution in order to modify the traditional TF-IDF weight calculation method. We found the new text clustering algorithm was superior to the traditional weight calculation method after testing it in open platform of Chinese text categorization corpus data set. Therefore, we concluded that the new text clustering algorithm might have vast foreground of application. To solve the shortcomings of the traditional weight calculation method of feature items, text clustering algorithm based on K-medoids was proposed. The frequency and inverse document frequency were improved, and the influence of document category on feature weight was further studied. At the same time, because there may not be any standard classification datasets in practice, a new weight calculation method combining category and semantic contribution was proposed. First, the semantic contribution was proposed and then combined with fuzzy clustering. A text set with category information was obtained by rough clustering of text set without category information. Then, the category information entropy was proposed and combined with the semantic contribution to improve the traditional TF-IDF weight calculation method. Thus, a more effective weight calculation method was obtained. The Chinese text categorization corpus dataset in open platform of Chinese natural language processing of Fudan University was used for testing. The results showed that the new method for weight calculation of feature items was superior to the traditional weight calculation method. It is concluded that the improved text clustering algorithm can be used in a wider range of occasions.

[1]  Aalaa Mojahed,et al.  An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach , 2016, Knowledge and Information Systems.

[2]  HanJiuqi,et al.  l0-norm based structural sparse least square regression for feature selection , 2015 .

[3]  Aaron Golden,et al.  Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach , 2015, BMC Bioinformatics.

[4]  Musa Peker,et al.  A decision support system to improve medical diagnosis using a combination of k-medoids clustering based attribute weighting and SVM , 2016, Journal of Medical Systems.

[5]  Yu Li,et al.  Mahalanobis distance based on fuzzy clustering algorithm for image segmentation , 2015, Digit. Signal Process..

[6]  H. Bolfarine,et al.  Likelihood-based inference for multivariate skew scale mixtures of normal distributions , 2016 .

[7]  SunChengyu,et al.  A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously , 2016 .

[8]  Hamid Beigy,et al.  Active constrained fuzzy clustering: A multiple kernels learning approach , 2015, Pattern Recognit..

[9]  L. Testi,et al.  A CLUSTER IN THE MAKING: ALMA REVEALS THE INITIAL CONDITIONS FOR HIGH-MASS CLUSTER FORMATION , 2015, 1501.07368.

[10]  Hector Budman,et al.  Robust Algorithms for Simultaneous Model Identification and Optimization in the Presence of Model-Plant Mismatch , 2015 .

[11]  Chee Peng Lim,et al.  New K-medoids Clustering and Swarm Intelligence Approach to Fire Flame Detection , 2016 .

[12]  Chennai,et al.  A State of Art Analysis of Telecommunication Data by k-Means and k-Medoids Clustering Algorithms , 2018 .

[13]  Jianhua Xu,et al.  A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously , 2016, Knowl. Based Syst..