Clustering Large Datasets with Apriori-based Algorithm and Concurrent Processing

 Abstract—This paper presents the integrated data mining processing technique to find appropriate initial centroids in data clustering process by k-means algorithm. The processes include data cleansing, preprocessing, and finding features relation with Apriori algorithm to get appropriate features. Our clustering process compares different initial selection schemes: static selection and random selection. The calculation of SSE (Sum of Square Error) uses parallel calculation for better computational performance. We propose the Pre-KMA model that represents the processes for finding appropriate initial clustering centroids and selecting the most relevant features from large datasets. The clustering evaluation results of SSE, loop of clustering, and time of processing confirm that with the Pre-KMA model we can get better clustering result with k-means clustering methodology. The experimental result shows that calculated SSE and processing time are decreased.