Barcodes for genomes and applications

BackgroundEach genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each genome and termed the genome's barcode.ResultsWe found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.ConclusionThese and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.

[1]  V N Rybchin,et al.  The plasmid prophage N15: a linear DNA with covalently closed ends , 1999, Molecular microbiology.

[2]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[3]  S. Karlin,et al.  Highly expressed and alien genes of the Synechocystis genome. , 2001, Nucleic acids research.

[4]  H. Ochman,et al.  Amelioration of Bacterial Genomes: Rates of Change and Exchange , 1997, Journal of Molecular Evolution.

[5]  S. Karlin,et al.  A chimeric prokaryotic ancestry of mitochondria and primitive eukaryotes. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[7]  S. Karlin,et al.  The extended environment of mononuclear metal centers in protein structures. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S Karlin,et al.  Detecting Alien Genes in Bacterial Genomes a , 1999, Annals of the New York Academy of Sciences.

[9]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[10]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[11]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[12]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[13]  Alice C McHardy,et al.  What's in the mix: phylogenetic classification of metagenome sequence samples. , 2007, Current opinion in microbiology.

[14]  H. Ochman,et al.  Lateral gene transfer and the nature of bacterial innovation , 2000, Nature.

[15]  T. Frey,et al.  Neurological aspects of rubella virus infection. , 1997, Intervirology.

[16]  Jie Dong,et al.  Comparative genomics and phylogenetic analysis of S. dysenteriae subgroup , 2008, Science in China Series C: Life Sciences.

[17]  S. Karlin,et al.  Predicted highly expressed genes in archaeal genomes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[19]  Jacques van Helden,et al.  Prophinder: a computational tool for prophage prediction in prokaryotic genomes , 2008, Bioinform..

[20]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[21]  Fenglou Mao,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Lake,et al.  Horizontal gene transfer among genomes: the complexity hypothesis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[24]  F. Bäckhed,et al.  Host-Bacterial Mutualism in the Human Intestine , 2005, Science.

[25]  S. Karlin,et al.  Predicted Highly Expressed Genes of Diverse Prokaryotic Genomes , 2000, Journal of bacteriology.