A Method for Massive Scientific Literature Clustering Based on Hadoop

With the development of science and technology and a large numbers of advanced vocabularies, the traditional classification of disciplines cannot meet the current needs of the subject division of scientific literature. At the same time, the clustering of the scientific literature put forward more requirements to the efficiency of the methods and the corresponding software and hardware facilities. In this paper, text features are extracted based on the TF-IDF method and the features of scientific literature. In Hadoop distributed environment, text clustering is carried out through Canopy-Kmeans algorithm, which achieved clustering of the massive scientific literature. As a result, our method proposed in this paper has improved key indicators compared to previous algorithms and greatly improved the efficiency of clustering.