An analysis of the Sargasso Sea resource and the consequences for database composition

BackgroundThe environmental sequencing of the Sargasso Sea has introduced a huge new resource of genomic information. Unlike the protein sequences held in the current searchable databases, the Sargasso Sea sequences originate from a single marine environment and have been sequenced from species that are not easily obtainable by laboratory cultivation. The resource also contains very many fragments of whole protein sequences, a side effect of the shotgun sequencing method.These sequences form a significant addendum to the current searchable databases but also present us with some intrinsic difficulties. While it is important to know whether it is possible to assign function to these sequences with the current methods and whether they will increase our capacity to explore sequence space, it is also interesting to know how current bioinformatics techniques will deal with the new sequences in the resource.ResultsThe Sargasso Sea sequences seem to introduce a bias that decreases the potential of current methods to propose structure and function for new proteins. In particular the high proportion of sequence fragments in the resource seems to result in poor quality multiple alignments.ConclusionThese observations suggest that the new sequences should be used with care, especially if the information is to be used in large scale analyses. On a positive note, the results may just spark improvements in computational and experimental methods to take into account the fragments generated by environmental sequencing techniques.

[1]  M O Dayhoff Computer analysis of protein sequences. , 1974, Federation proceedings.

[2]  L. Holm,et al.  Unification of protein families. , 1998, Current opinion in structural biology.

[3]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[4]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  Anna Tramontano,et al.  Assessment of homology‐based predictions in CASP5 , 2003, Proteins.

[7]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[8]  P Bork,et al.  An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Emile Zuckerkandl,et al.  The appearance of new structures and functions in proteins during evolution , 1975, Journal of Molecular Evolution.

[10]  R Leplae,et al.  Analysis and assessment of comparative modeling predictions in CASP4 , 2001, Proteins.

[11]  Amos Bairoch,et al.  Swiss-Prot: Juggling between evolution and stability , 2004, Briefings Bioinform..

[12]  J. Wiens,et al.  INCOMPLETE TAXA, INCOMPLETE CHARACTERS, AND PHYLOGENETIC ACCURACY: IS THERE A MISSING DATA PROBLEM? , 2003 .

[13]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[14]  Michael Y. Galperin,et al.  Metagenomics: from acid mine to shining sea. , 2004, Environmental microbiology.

[15]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[16]  Gregory J. Crowther,et al.  Analysis of Gene Islands Involved in Methanopterin-Linked C1 Transfer Reactions Reveals New Functions and Provides Evolutionary Insights , 2005, Journal of bacteriology.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Fredj Tekaia,et al.  Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. , 2002, Gene.

[19]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[20]  Lisa N Kinch,et al.  CASP5 target classification , 2003, Proteins.

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[23]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[24]  Jacques Meyer Miraculous catch of iron–sulfur protein sequences in the Sargasso Sea , 2004, FEBS letters.

[25]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[26]  A. Sali 100,000 protein structures for the biologist , 1998, Nature Structural Biology.

[27]  David T. Jones,et al.  Getting the most from PSI-BLAST. , 2002, Trends in biochemical sciences.

[28]  J. Spudich,et al.  New Insights into Metabolic Properties of Marine Bacteria Encoding Proteorhodopsins , 2005, PLoS biology.

[29]  Alfonso Valencia,et al.  Predicting reliable regions in protein alignments from sequence profiles. , 2003, Journal of molecular biology.

[30]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[31]  D. Cozzetto,et al.  Relationship between multiple sequence alignments and quality of protein comparative models , 2004, Proteins.

[32]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[33]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[34]  Alfonso Valencia,et al.  Domain definition and target classification for CASP6 , 2005, Proteins.

[35]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..