Virus classification in 60-dimensional protein space.

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.

[1]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[2]  E. Holmes The comparative genomics of viral emergence , 2010, Proceedings of the National Academy of Sciences.

[3]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[4]  Sung-Hou Kim,et al.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method , 2009, Proceedings of the National Academy of Sciences.

[5]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[6]  E. Holmes,et al.  Rates of evolutionary change in viruses: patterns and determinants , 2008, Nature Reviews Genetics.

[7]  Changchuan Yin,et al.  Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences , 2015, PloS one.

[8]  Zhao Xu,et al.  CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes , 2009, Nucleic Acids Res..

[9]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[10]  V. Anh,et al.  Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. , 2015, Molecular phylogenetics and evolution.

[11]  Yu Wang,et al.  Origin and diversity of novel avian influenza A H7N9 viruses causing human infection: phylogenetic, structural, and coalescent analyses , 2013, The Lancet.

[12]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[13]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Manesh Shah,et al.  Twelve previously unknown phage genera are ubiquitous in global oceans , 2013, Proceedings of the National Academy of Sciences.

[15]  Li-Qian Zhou,et al.  Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model , 2010, BMC Evolutionary Biology.

[16]  S. Yau,et al.  Viral genome phylogeny based on Lempel-Ziv complexity and Hausdorff distance. , 2014, Journal of theoretical biology.

[17]  J. Fletcher,et al.  Common Elements of Spiroplasma Plectroviruses Revealed by Nucleotide Sequence of SVTS2 , 2004, Virus Genes.

[18]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[19]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[20]  Troy Hernandez,et al.  Global comparison of multiple-segmented viruses in 12-dimensional genome space. , 2014, Molecular phylogenetics and evolution.

[21]  Zu-Guo Yu,et al.  Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment , 2010, International journal of molecular sciences.

[22]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.