A fast and scalable FPGA-based parallel processing architecture for K-means clustering for big data analysis

The exponential growth of complex, heterogeneous, dynamic, and unbounded data, generated by a variety of fields including health, genomics, physics, climatology, and social networks pose significant challenges in data processing and desired speed-performance. Existing processor-based software-only algorithms are incapable of analyzing and processing this enormous amount of data, efficiently and effectively. Consequently, some kind of hardware support is desirable to overcome the challenges in analyzing big data. Big data analytics involves many important data mining tasks including clustering, which categorizes the data into meaningful groups based on the similarity or dissimilarity among objects. In this research work, we introduce an efficient FPGA-based parallel processing architecture for K-means Clustering, one of the most popular clustering algorithms. Experiments are performed on a benchmark dataset to evaluate the feasibility and efficiency of our hardware design. Our hardware architecture is generic, parameterized, and scalable to support larger and varying datasets as well as a varying number of clusters. Our hardware configuration with 32 processing elements (PEs) achieved 368 times speedup compared to its software counterpart.

[1]  Seth Earley Really, Really Big Data: NASA at the Forefront of Analytics , 2016, IT Professional.

[2]  James Theiler,et al.  Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware , 2001, FPGA '01.

[3]  Huseyin Seker,et al.  FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[4]  Maya Gokhale,et al.  Applying reconfigurable hardware to the analysis of multispectral and hyperspectral imagery , 2002, SPIE Optics + Photonics.

[5]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[6]  Dominique Lavenier FPGA implementation of the k-means clustering algorithm for hyperspectral images , 2000 .

[7]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[10]  Abbes Amira,et al.  A high speed configurable FPGA architecture for k-mean clustering , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[11]  Tsutomu Maruyama,et al.  An FPGA Implementation of K-Means Clustering for Color Images Based on Kd-Tree , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[12]  Kin Fun Li,et al.  FPGA-Based Reconfigurable Hardware for Compute Intensive Data Mining Applications , 2011, 2011 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[13]  Venkatesh Bhaskaran,et al.  Parameterized Implementation of K-means Clustering on Reconfigurable Systems , 2004 .

[14]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[15]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  M. Gokhale,et al.  Title : Early Experience with a Hybrid Processor : K-Means Clustering , 2001 .

[18]  Huseyin Seker,et al.  Highly Parameterized K-means Clustering on FPGAs: Comparative Results with GPPs and GPUs , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[19]  Michael J. Schulte,et al.  An Overview of Reconfigurable Hardware in Embedded Systems , 2006, EURASIP J. Embed. Syst..