Design and Application of a Text Clustering Algorithm Based on Parallelized K-Means Clustering

Received: 1 May 2019 Accepted: 8 August 2019 The traditional text clustering algorithms face two common problems: the high dimensionality of computing vectors and poor calculation efficiency. To solve these problems, this paper explores deep into the K-means clustering (KMC), Hadoop and Spark big data technique, and then proposes a novel text clustering algorithm based on the KMC parallelized on big data platform. The propose algorithm is denoted as the SWCK-means. First, the Word2vec was adopted to calculate the weights of word vectors, and thus reduce the dimensionality of the massive text data. Next, the Canopy algorithm was introduced to cluster the weight data, and identify the initial cluster centers for the KMC. On this basis, the KMC was employed to cluster the preprocessed data. To improve the efficiency, a parallel design for the Canopy algorithm and the KMC was developed under the Spark architecture. The proposed algorithm was verified through experiments on a massive amount of online text data. The results show that our algorithm achieved more accurate classification effects than the traditional KMC, especially in handling a huge amount of data.

[1]  Ahmed F. Al-Refaie,et al.  G PU A ccelerated IN tensities MPI (GAIN-MPI): A new method of computing Einstein-A coefficients , 2017, Comput. Phys. Commun..

[2]  Jian-Fei Tong User clustering based on Canopy+K-means algorithm in cloud computing , 2017 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  Dunja Mladenic,et al.  Cross-lingual document similarity estimation and dictionary generation with comparable corpora , 2018, Knowledge and Information Systems.

[6]  Dechang Pi,et al.  Chinese Text Clustering Algorithm Based k-means , 2012 .

[7]  Somjit Arch-int,et al.  Determination of the appropriate parameters for K‐means clustering using selection of region clusters based on density DBSCAN (SRCD‐DBSCAN) , 2017, Expert Syst. J. Knowl. Eng..

[8]  Masoud Rahgozar,et al.  A query term re-weighting approach using document similarity , 2016, Inf. Process. Manag..

[9]  Tao Zhang,et al.  基于Spark的MapReduce相似度计算效率优化 (Efficiency Optimization Method for MapReduce Similarity Computing Based on Spark) , 2017, 计算机科学.

[10]  Yuping Zhang,et al.  An Initialization Method Based on Hybrid Distance for k-Means Algorithm , 2017, Neural Computation.

[11]  Maozhen Li,et al.  The Parallelization of Back Propagation Neural Network in MapReduce and Spark , 2016, International Journal of Parallel Programming.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Feng Xiaobing,et al.  Heterogeneous Memory Programming Framework Based on Spark for Big Data Processing , 2018 .

[15]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[16]  Vo Thi Ngoc Chau,et al.  SVM for English semantic classification in parallel environment , 2017, Int. J. Speech Technol..