Characterization of protein families, sequence patterns, and functional annotations in large data sets

Background: In order to perform taxonomically unbiased analyses of protein relationships, there is a need ofcomplete proteomes rather than databases with bias towards well characterized protein families. However, nocomprehensive resource of completed proteomes is currently available. Instead, the proteomes need to be down-loaded manually from di®erent servers, all using different filename conventions and fasta header formats. Results: We have developed a semi-automatic algorithm that retrieves complete proteomes from multiple FTP-servers and maps the species-speci¯c sequence entries to the NCBI taxonomy. The compiled data is provided ina sequence database named genomeLKPG. Conclusions: The usefulness of genomeLKPG is proven in several published taxonomical studies.

[1]  Nan Guo,et al.  PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways , 2006, Nucleic Acids Res..

[2]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[3]  Andrew D. Smith,et al.  SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution , 2005, BMC Bioinformatics.

[4]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[5]  Zheng Rong Yang,et al.  Biological applications of support vector machines , 2004, Briefings Bioinform..

[6]  Antony V. Cox,et al.  The Ensembl Web site: mechanics of a genome browser. , 2004, Genome research.

[7]  P L Schuyler,et al.  The UMLS Metathesaurus: representing different views of biomedical concepts. , 1993, Bulletin of the Medical Library Association.

[8]  Rolf Apweiler,et al.  The EBI SRS server-new features , 2002, Bioinform..

[9]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[10]  M. Benton Finding the tree of life: matching phylogenetic trees to the fossil record through the 20th century , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[11]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[12]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[13]  Erin Beck,et al.  The comprehensive microbial resource , 2000, Nucleic Acids Res..

[14]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[15]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[16]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[17]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[18]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[19]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[20]  Robert Fredriksson,et al.  Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery , 2005, FEBS letters.

[21]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[22]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[23]  A. Knoll,et al.  The early evolution of eukaryotes: a geological perspective. , 1992, Science.

[24]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[25]  David A. Lee,et al.  Gene3D: modelling protein structure, function and evolution , 2005, Nucleic Acids Res..

[26]  H. Pearson Biology's name game , 2001, Nature.

[27]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..

[28]  T. Gingeras,et al.  TUF Love for “Junk” DNA , 2006, Cell.

[29]  B. Persson,et al.  Analysis of ancient sequence motifs in the H+‐PPase family , 2006, The FEBS journal.

[30]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[31]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[32]  E. Yeramian,et al.  Evolution of proteomes: fundamental signatures and global trends in amino acid compositions , 2006, BMC Genomics.

[33]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[34]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[35]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[36]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[37]  Michelle G. Giglio,et al.  TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[38]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[39]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[40]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[41]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[42]  S. Brenner A tour of structural genomics , 2001, Nature Reviews Genetics.

[43]  Bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[44]  Alessandro Guffanti,et al.  The tripartite motif family identifies cell compartments , 2001, The EMBO journal.

[45]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2007: families and functions , 2006, Nucleic Acids Res..

[46]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[47]  M. Benton,et al.  Paleontological evidence to date the tree of life. , 2006, Molecular biology and evolution.

[48]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[49]  Sue A. Olson,et al.  Emboss opens up sequence analysis , 2002, Briefings Bioinform..

[50]  Masashi Miyano,et al.  Crystal structure of a human membrane protein involved in cysteinyl leukotriene biosynthesis , 2007, Nature.

[51]  Tomonori Gotoh,et al.  Availability of short amino acid sequences in proteins , 2005, Protein science : a publication of the Protein Society.

[52]  M A Soto,et al.  A pentapeptide-based method for protein secondary structure prediction. , 2003, Protein engineering.

[53]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[54]  G. Meroni,et al.  TRIM/RBCC, a novel class of ‘single protein RING finger’ E3 ubiquitin ligases , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[55]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[56]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[57]  Ingmar Reuter,et al.  Integr8 and Genome Reviews: integrated views of complete genomes and proteomes , 2004, Nucleic Acids Res..

[58]  Brian D. Marsden,et al.  The scientific impact of the Structural Genomics Consortium: a protein family and ligand-centered approach to medically-relevant human proteins , 2007, Journal of Structural and Functional Genomics.

[59]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[60]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[61]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[62]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[63]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[64]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[65]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[66]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[67]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[68]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[69]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[70]  An-Suei Yang,et al.  Local Structure Prediction with Local Structure-based Sequence Profiles , 2003, Bioinform..

[71]  S. Searle,et al.  The Ensembl analysis pipeline. , 2004, Genome research.

[72]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[73]  J. Sodroski,et al.  Functional Replacement of the RING, B-Box 2, and Coiled-Coil Domains of Tripartite Motif 5α (TRIM5α) by Heterologous TRIM Domains , 2006, Journal of Virology.

[74]  Ralf Morgenstern,et al.  The 3-D structure of microsomal glutathione transferase 1 at 6 A resolution as determined by electron crystallography of p22(1)2(1) crystals. , 2002, Biochimica et biophysica acta.

[75]  D. Richardson,et al.  Assignment of enzyme substrate specificity by principal component analysis of aligned protein sequences: An experimental test using DNA glycosylase homologs , 2000, Proteins.

[76]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[77]  Ralf Morgenstern,et al.  Structural basis for detoxification and oxidative stress protection in membranes. , 2006, Journal of molecular biology.

[78]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[79]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[80]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[81]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[82]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[83]  Bengt Persson,et al.  KIND-a non-redundant protein database , 1999, Bioinform..

[84]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[85]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[86]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[87]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[88]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[89]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[90]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[91]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[92]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[93]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[94]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[95]  T. Gingeras,et al.  Genome-wide transcription and the implications for genomic organization , 2007, Nature Reviews Genetics.

[96]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[97]  S. Vuilleumier Bacterial glutathione S-transferases: what are they good for? , 1997, Journal of bacteriology.

[98]  C. V. Jongeneel,et al.  Making Sense of Score Statistics for Sequence Alignments , 2001, Briefings Bioinform..

[99]  Jane Loveland,et al.  VEGA, the genome browser with a difference , 2005, Briefings Bioinform..

[100]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[101]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[102]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[103]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[104]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2004, Nucleic acids research.

[105]  J. Hennig,et al.  Structural, functional and immunologic characterization of folded subdomains in the Ro52 protein targeted in Sjögren's syndrome. , 2006, Molecular immunology.

[106]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[107]  B. Persson,et al.  Common structural features of mapeg—a widespread superfamily of membrane associated proteins with highly divergent functions in eicosanoid and glutathione metabolism , 2008, Protein science : a publication of the Protein Society.

[108]  L. Hug,et al.  The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[109]  Merlin Crossley,et al.  Sticky fingers: zinc-fingers as protein-recognition motifs. , 2007, Trends in biochemical sciences.

[110]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[111]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[112]  Orna Man,et al.  Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[113]  V. Kuchroo,et al.  Structural Organization and Zn2+-dependent Subdomain Interactions Involving Autoantigenic Epitopes in the Ring-B-box-Coiled-coil (RBCC) Region of Ro52* , 2005, Journal of Biological Chemistry.

[114]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[115]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[116]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[117]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[118]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[119]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[120]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[121]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[122]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[123]  U. Sauer,et al.  Getting Closer to the Whole Picture , 2007, Science.

[124]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[125]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..