Cluster ensemble based on Random Forests for genetic data

BackgroundClustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable.ResultsRandom Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance.ConclusionsThis paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.

[1]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[2]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Jong Bhak,et al.  PanSNPdb: The Pan-Asian SNP Genotyping Database , 2011, PloS one.

[6]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[7]  William F. Punch,et al.  A Comparison of Resampling Methods for Clustering Ensembles , 2004, IC-AI.

[8]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[9]  Sotirios A. Tsaftaris,et al.  Service Clustering for Autonomic Clouds Using Random Forest , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10]  Tossapon Boongoen,et al.  New cluster ensemble approach to integrative biological data analysis , 2013, Int. J. Data Min. Bioinform..

[11]  Natthakan Iam-On,et al.  LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles , 2010 .

[12]  K. Kidd,et al.  Developing a SNP panel for forensic identification of individuals. , 2006, Forensic science international.

[13]  Mehrdad Nourani,et al.  Distance metric learning using random forest for cytometry data , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[14]  Jun Zhang,et al.  Network traffic clustering using Random Forest proximities , 2013, 2013 IEEE International Conference on Communications (ICC).

[15]  Mohamed S. Kamel,et al.  Finding Natural Clusters Using Multi-clusterer Combiner Based on Shared Nearest Neighbors , 2003, Multiple Classifier Systems.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  Shifeng Chen,et al.  Detecting Co-Salient Objects in Large Image Sets , 2015, IEEE Signal Processing Letters.

[18]  Hongbin Zha,et al.  Anatomical structure similarity estimation by random forest , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  M. Cugmas,et al.  On comparing partitions , 2015 .

[20]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[21]  Onisimo Mutanga,et al.  Random Forests Unsupervised Classification: The Detection and Mapping of Solanum mauritianum Infestations in Plantation Forestry Using Hyperspectral Data , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[22]  Seán McLoone,et al.  Fault detection using random forest similarity distance , 2015 .

[23]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[25]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[26]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[27]  Daniel A. Ashlock,et al.  MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering , 2009, BMC Bioinformatics.

[28]  Joshua D. Starmer,et al.  AWclust: point-and-click software for non-parametric population structure analysis , 2008, BMC Bioinformatics.

[29]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[30]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[31]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[32]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[33]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[34]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[35]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Robert P. W. Duin,et al.  The Dissimilarity Representation for Pattern Recognition - Foundations and Applications , 2005, Series in Machine Perception and Artificial Intelligence.

[38]  David S. Doermann,et al.  Unsupervised Classification of Structurally Similar Document Images , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[39]  Ludmila I. Kuncheva,et al.  Moderate diversity for better cluster ensembles , 2006, Inf. Fusion.

[40]  Sotirios A. Tsaftaris,et al.  Supporting Autonomic Management of Clouds: Service Clustering With Random Forest , 2016, IEEE Transactions on Network and Service Management.

[41]  Joe H. Ward,et al.  Application of an Hierarchical Grouping Procedure to a Problem of Grouping Profiles , 1963 .

[42]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[43]  Rui Mei,et al.  Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation , 2005, Human Genomics.

[44]  Agnieszka Smolinska,et al.  Unsupervised random forest: a tutorial with case studies , 2016 .

[45]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Mark D Shriver,et al.  The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs , 2004, Human Genomics.

[47]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[48]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[49]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.