Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses

BackgroundClustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results.ResultsWe presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92–1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty.ConclusionsWe formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.

[1]  R C Durfee,et al.  A METHOD OF CLUSTER ANALYSIS. , 1970, Multivariate behavioral research.

[2]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[3]  B. Efron,et al.  Bootstrap confidence levels for phylogenetic trees. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  Guoqing Lu,et al.  FluGenome: a web tool for genotyping influenza A virus , 2007, Nucleic Acids Res..

[7]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[8]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[9]  Xiuzhen Huang,et al.  A practical comparison of two K-Means clustering algorithms , 2008, BMC Bioinformatics.

[10]  O. H. Lowry Academic press. , 1972, Analytical chemistry.

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[13]  Oie,et al.  Toward a Unified Nomenclature System for Highly Pathogenic Avian Influenza Virus (H5N1) , 2008, Emerging infectious diseases.

[14]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[15]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  M. Cugmas,et al.  On comparing partitions , 2015 .

[18]  Joseph Felsenstein,et al.  Statistical inference of phylogenies , 1983 .

[19]  C. Russell,et al.  Continuing progress towards a unified nomenclature for the highly pathogenic H5N1 avian influenza viruses: divergence of clade 2·2 viruses , 2009, Influenza and other respiratory viruses.

[20]  Ervin Fodor,et al.  Options for the control of influenza VI , 2008 .

[21]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[22]  B. Efron,et al.  Bootstrap confidence levels for phylogenetic trees. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Bing-Yi Jing,et al.  On Sample Reuse Methods for Dependent Data , 1996 .

[24]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[25]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[26]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .