Semantic Correlation Network Based Text Clustering

Text documents have sparse data spaces, and nearest neighbors may belong to different classes when using current existing proximity measures to describe the correlation of documents. In this paper, we propose an asymmetric similarity measure to strengthen the discriminative feature of document objects. We construct a semantic correlation network by asymmetric similarity between documents and conjecture the power law feature of the connections distributions. Hub points which exist in semantic correlation network are classified by an agglomerative hierarchical clustering approach named SCN. Both objects similarity and neighbors similarity are considered in the definition of hub points proximity. Finally, we assign the rest text objects to their nearest hub points. The experimental evaluation on textual data sets demonstrates the validity and efficiency of SCN. The comparison with other clustering algorithms shows the superiority of our approach.