Novel text clustering approach based on R-Grams

Focusing on the issue that the clustering accuracy rate and recall rate are difficult to balance in traditional text clustering algorithms, a clustering approach based on the R-Grams text similarity computing algorithm was proposed. Firstly,the clustered documents were sorted in descending order; secondly, the symbolic documents were identified and then initial clustering results were achieved by using an R-Grams-based similarity computing algorithm; finally, the final clustering results were completed by combining the initial clustering. The experimental results show that the proposed approach can flexibly regulate the clustering results by adjusting the clustering threshold parameter to satisfy different demands and the optimal parameter is about 15. With the increasing of the clustering threshold, the clustering accuracies increase, and the recalls increase at first, then decrease. In addition, the approach is free from time-consuming processing procedures such as word segmentation and feature extraction and can be easily implemented.