Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis

This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.

[1]  Bo Wang,et al.  Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection , 2010, Nature Genetics.

[2]  Paulo Novais,et al.  A visual analytics framework for cluster analysis of DNA microarray data , 2013, Expert Syst. Appl..

[3]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[4]  Aruna Tiwari,et al.  Handling Big Data with Fuzzy Based Classification Approach , 2013, WCSC.

[5]  Jingwei Liu,et al.  Kernelized fuzzy attribute C-means clustering algorithm , 2008, Fuzzy Sets Syst..

[6]  Marimuthu Palaniswami,et al.  Incremental Kernel Fuzzy c-Means , 2010, IJCCI.

[7]  Du-Ming Tsai,et al.  Fuzzy C-means based clustering for linearly and nonlinearly separable data , 2011, Pattern Recognit..

[8]  Tae-Ho Lee,et al.  SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data , 2014, BMC Genomics.

[9]  Fernando José Von Zuben,et al.  Automatic feature selection for BCI: An analysis using the davies-bouldin index and extreme learning machines , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[10]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[11]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[13]  James M. Keller,et al.  eCCV: A new fuzzy cluster validity measure for large relational bioinformatics datasets , 2009, 2009 IEEE International Conference on Fuzzy Systems.

[14]  Preeti Jha,et al.  A Novel Scalable Kernelized Fuzzy Clustering Algorithms Based on In-Memory Computation for Handling Big Data , 2020 .

[15]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[16]  Takuji Sasaki,et al.  The map-based sequence of the rice genome , 2005, Nature.

[17]  John F. Kolen,et al.  Reducing the time complexity of the fuzzy c-means algorithm , 2002, IEEE Trans. Fuzzy Syst..

[18]  Veit Schwämmle,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[19]  Long Chen,et al.  Multiple Kernel Shadowed Clustering in Approximated Feature Space , 2018, DMBD.

[20]  Susan McCouch,et al.  Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding , 2013, Rice.

[21]  Cheng Ling,et al.  SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[22]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..

[23]  M. Thomson,et al.  Genome-wide Association Analysis Tracks Bacterial Leaf Blight Resistance Loci In Rice Diverse Germplasm , 2017, Rice.

[24]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[25]  Libin Liu,et al.  Clustering DNA sequences by feature vectors. , 2006, Molecular phylogenetics and evolution.

[26]  Zhongdong Wu,et al.  Fuzzy C-means clustering algorithm based on kernel method , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[27]  Mohd Saberi Mohamad,et al.  A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data , 2014 .

[28]  Yung-Yu Chuang,et al.  Multiple Kernel Fuzzy Clustering , 2012, IEEE Transactions on Fuzzy Systems.

[29]  Ayoub Ait Lahcen,et al.  Big Data technologies: A survey , 2017, J. King Saud Univ. Comput. Inf. Sci..

[30]  Lingning Kong,et al.  Fuzzy clustering in high-dimensional approximated feature space , 2016, 2016 International Conference on Fuzzy Theory and Its Applications (iFuzzy).

[31]  Kourosh Kiani,et al.  A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray , 2019, Eng. Appl. Artif. Intell..

[32]  Inna Dubchak,et al.  Rice SNP-seek database update: new SNPs, indels, and queries , 2016, Nucleic Acids Res..

[33]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[34]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[35]  Kourosh Kiani,et al.  A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark , 2018, Symmetry.

[36]  Daoqiang Zhang,et al.  Robust fuzzy relational classifier incorporating the soft class labels , 2007, Pattern Recognit. Lett..

[37]  Vincenzo Catania,et al.  An evolutionary fuzzy c-means approach for clustering of bio-informatics databases , 2008, 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence).

[38]  Juan Touriño,et al.  Performance evaluation of big data frameworks for large-scale data analytics , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[39]  Witold Pedrycz,et al.  Interval kernel Fuzzy C-Means clustering of incomplete data , 2017, Neurocomputing.

[40]  Aruna Tiwari,et al.  Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark , 2016, IEEE Transactions on Big Data.