Genomic Classification Using an Information-Based Similarity Index: Application to the SARS Coronavirus

Measures of genetic distance based on alignment methods are confined to studying sequences that are conserved and identifiable in all organisms under study. A number of alignment-free techniques based on either statistical linguistics or information theory have been developed to overcome the limitations of alignment methods. We present a novel alignment-free approach to measuring the similarity among genetic sequences that incorporates elements from both word rank order-frequency statistics and information theory. We first validate this method on the human influenza A viral genomes as well as on the human mitochondrial DNA database. We then apply the method to study the origin of the SARS coronavirus. We find that the majority of the SARS genome is most closely related to group 1 coronaviruses, with smaller regions of matches to sequences from groups 2 and 3. The information based similarity index provides a new tool to measure the similarity between datasets based on their information content and may have a wide range of applications in the large-scale analysis of genomic databases.

[1]  D. Mindell Fundamentals of molecular evolution , 1991 .

[2]  Flood of Sequence Data Yields Clues But Few Answers , 2003, Science.

[3]  S. Pääbo,et al.  Mitochondrial genome variation and the origin of modern humans , 2000, Nature.

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[6]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[7]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[8]  A. J. Gibbs,et al.  The phylogeny of SARS coronavirus , 2004, Archives of Virology.

[9]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[10]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[11]  W. Fitch,et al.  Phylogenetic analysis of nucleoproteins suggests that human influenza A viruses emerged from a 19th-century avian ancestor. , 1990, Molecular biology and evolution.

[12]  J. Dushoff,et al.  Evolution and persistence of influenza A and other diseases. , 2004, Mathematical biosciences.

[13]  Huey-Wen Yien,et al.  Information categorization approach to literary authorship disputes , 2003 .

[14]  H. Heffner,et al.  The Evolution of Human , 2004 .

[15]  W. Fitch,et al.  Evolution of human influenza A viruses over 50 years: rapid, uniform rate of change in NS gene. , 1986, Science.

[16]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[17]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[18]  Y. Guan,et al.  Unique and Conserved Features of Genome and Proteome of SARS-coronavirus, an Early Split-off From the Coronavirus Group 2 Lineage , 2003, Journal of Molecular Biology.

[19]  Bailin Hao,et al.  PROKARYOTIC PHYLOGENY BASED ON COMPLETE GENOMES WITHOUT SEQUENCE ALIGNMENT , 2003 .

[20]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[21]  Marty C. Brandon,et al.  Natural selection shaped regional mtDNA variation in humans , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[22]  R. Rappuoli,et al.  SARS — beginning to understand a new virus , 2003, Nature Reviews Microbiology.

[23]  Obi L. Griffith,et al.  The Genome Sequence of the SARS-Associated Coronavirus , 2003, Science.

[24]  M. Stoneking,et al.  Mitochondrial DNA and human evolution , 1987, Nature.

[25]  Huey-Wen Yien,et al.  Linguistic analysis of the human heartbeat using frequency and rank order statistics. , 2003, Physical review letters.

[26]  K. Hawkes,et al.  African populations and the evolution of human mitochondrial DNA. , 1991, Science.

[27]  X. L. Liu,et al.  Isolation and Characterization of Viruses Related to the SARS Coronavirus from Animals in Southern China , 2003, Science.

[28]  M. Ruvolo,et al.  Mitochondrial COII sequences and modern human origins. , 1993, Molecular biology and evolution.

[29]  S. Karlin,et al.  Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[31]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[32]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[33]  Y. Guan,et al.  Coronavirus as a possible cause of severe acute respiratory syndrome , 2003, The Lancet.

[34]  Christian Drosten,et al.  Identification of a novel coronavirus in patients with severe acute respiratory syndrome. , 2003, The New England journal of medicine.

[35]  Walter Nadler,et al.  Comment on "Linguistic analysis of the human heartbeat using frequency and rank order statistics". , 2004, Physical review letters.

[36]  M. Enserink Clues to the Animal Origins of SARS , 2003, Science.

[37]  S. Du,et al.  Yanget al. reply , 2004 .

[38]  Y. Qi,et al.  Dissecting RNA silencing in protoplasts uncovers novel effects of viral suppressors on the silencing pathway at the cellular level. , 2004, Nucleic acids research.

[39]  Arthur S Slutsky,et al.  Identification of severe acute respiratory syndrome in Canada. , 2003, The New England journal of medicine.

[40]  N. Takahata,et al.  Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[41]  M. Enserink Infectious diseases. Clues to the animal origins of SARS. , 2003, Science.

[42]  P. Chaudhuri,et al.  SWORDS: A statistical tool for analysing large DNA sequences , 2002, Journal of Biosciences.

[43]  Peter Cameron,et al.  A major outbreak of severe acute respiratory syndrome in Hong Kong. , 2003, The New England journal of medicine.

[44]  Christian Drosten,et al.  Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome , 2003, Science.