论文信息 - Characterization of protein families, sequence patterns, and functional annotations in large data sets

Characterization of protein families, sequence patterns, and functional annotations in large data sets

Background: In order to perform taxonomically unbiased analyses of protein relationships, there is a need ofcomplete proteomes rather than databases with bias towards well characterized protein families. However, nocomprehensive resource of completed proteomes is currently available. Instead, the proteomes need to be down-loaded manually from di®erent servers, all using different filename conventions and fasta header formats. Results: We have developed a semi-automatic algorithm that retrieves complete proteomes from multiple FTP-servers and maps the species-speci¯c sequence entries to the NCBI taxonomy. The compiled data is provided ina sequence database named genomeLKPG. Conclusions: The usefulness of genomeLKPG is proven in several published taxonomical studies.

Bengt Persson | Inge Jonassen | I. Jonassen | Bengt Persson

[1] Nan Guo,et al. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways , 2006, Nucleic Acids Res..

[2] Amos Bairoch,et al. Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[3] Andrew D. Smith,et al. SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution , 2005, BMC Bioinformatics.

[4] R. Doolittle,et al. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[5] Zheng Rong Yang,et al. Biological applications of support vector machines , 2004, Briefings Bioinform..

[6] Antony V. Cox,et al. The Ensembl Web site: mechanics of a genome browser. , 2004, Genome research.

[7] P L Schuyler,et al. The UMLS Metathesaurus: representing different views of biomedical concepts. , 1993, Bulletin of the Medical Library Association.

[8] Rolf Apweiler,et al. The EBI SRS server-new features , 2002, Bioinform..

[9] F. Crick. Central Dogma of Molecular Biology , 1970, Nature.

[10] M. Benton. Finding the tree of life: matching phylogenetic trees to the fossil record through the 20th century , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[11] B. Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[12] Amos Bairoch,et al. PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[13] Erin Beck,et al. The comprehensive microbial resource , 2000, Nucleic Acids Res..

[14] Steven E Brenner,et al. The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[15] J. Felsenstein. CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[16] T. Andrews,et al. The Ensembl automatic gene annotation system. , 2004, Genome research.

[17] Rolf Apweiler,et al. InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[18] P. Rouzé,et al. Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[19] Chuong B. Do,et al. ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[20] Robert Fredriksson,et al. Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery , 2005, FEBS letters.

[21] D. Barrell,et al. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[22] R. Doolittle,et al. A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[23] A. Knoll,et al. The early evolution of eukaryotes: a geological perspective. , 1992, Science.

[24] C E Lipscomb,et al. Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[25] David A. Lee,et al. Gene3D: modelling protein structure, function and evolution , 2005, Nucleic Acids Res..

[26] H. Pearson. Biology's name game , 2001, Nature.

[27] Cathy H. Wu,et al. The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..

[28] T. Gingeras,et al. TUF Love for “Junk” DNA , 2006, Cell.

[29] B. Persson,et al. Analysis of ancient sequence motifs in the H+‐PPase family , 2006, The FEBS journal.

[30] K. Katoh,et al. MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[31] Burkhard Morgenstern,et al. DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[32] E. Yeramian,et al. Evolution of proteomes: fundamental signatures and global trends in amino acid compositions , 2006, BMC Genomics.

[33] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[34] Sophia Ananiadou,et al. Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[35] T. Perneger. What's wrong with Bonferroni adjustments , 1998, BMJ.

[36] Tom Fawcett,et al. ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[37] Michelle G. Giglio,et al. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[38] Sébastien Carrère,et al. The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[39] Nikos Kyrpides,et al. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[40] C. Sander,et al. A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[41] Sean R. Eddy,et al. Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[42] S. Brenner. A tour of structural genomics , 2001, Nature Reviews Genetics.

[43] Bernhard Schölkopf,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[44] Alessandro Guffanti,et al. The tripartite motif family identifies cell compartments , 2001, The EMBO journal.

[45] Cyrus Chothia,et al. The SUPERFAMILY database in 2007: families and functions , 2006, Nucleic Acids Res..

[46] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[47] M. Benton,et al. Paleontological evidence to date the tree of life. , 2006, Molecular biology and evolution.

[48] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[49] Sue A. Olson,et al. Emboss opens up sequence analysis , 2002, Briefings Bioinform..

[50] Masashi Miyano,et al. Crystal structure of a human membrane protein involved in cysteinyl leukotriene biosynthesis , 2007, Nature.

[51] Tomonori Gotoh,et al. Availability of short amino acid sequences in proteins , 2005, Protein science : a publication of the Protein Society.

[52] M A Soto,et al. A pentapeptide-based method for protein secondary structure prediction. , 2003, Protein engineering.

[53] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[54] G. Meroni,et al. TRIM/RBCC, a novel class of ‘single protein RING finger’ E3 ubiquitin ligases , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[55] Christopher J. Lee,et al. Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[56] Emily Dimmer,et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[57] Ingmar Reuter,et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes , 2004, Nucleic Acids Res..

[58] Brian D. Marsden,et al. The scientific impact of the Structural Genomics Consortium: a protein family and ligand-centered approach to medically-relevant human proteins , 2007, Journal of Structural and Functional Genomics.

[59] K. Katoh,et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[60] Lucila Ohno-Machado,et al. The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[61] C. Chothia,et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[62] D. Higgins,et al. T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[63] Olivier Poch,et al. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[64] Pierre Baldi,et al. Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[65] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[66] Elisabeth R. M. Tillier,et al. The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[67] P. Argos,et al. SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[68] William R. Hersh,et al. A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[69] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[70] An-Suei Yang,et al. Local Structure Prediction with Local Structure-based Sequence Profiles , 2003, Bioinform..

[71] S. Searle,et al. The Ensembl analysis pipeline. , 2004, Genome research.

[72] Bart De Moor,et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[73] J. Sodroski,et al. Functional Replacement of the RING, B-Box 2, and Coiled-Coil Domains of Tripartite Motif 5α (TRIM5α) by Heterologous TRIM Domains , 2006, Journal of Virology.

[74] Ralf Morgenstern,et al. The 3-D structure of microsomal glutathione transferase 1 at 6 A resolution as determined by electron crystallography of p22(1)2(1) crystals. , 2002, Biochimica et biophysica acta.

[75] D. Richardson,et al. Assignment of enzyme substrate specificity by principal component analysis of aligned protein sequences: An experimental test using DNA glycosylase homologs , 2000, Proteins.

[76] D. Haussler,et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[77] Ralf Morgenstern,et al. Structural basis for detoxification and oxidative stress protection in membranes. , 2006, Journal of molecular biology.

[78] Cathy H. Wu,et al. UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[79] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[80] Gapped BLAST and PSI-BLAST: A new , 1997 .

[81] Thure Etzold,et al. SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[82] B. Schölkopf,et al. Advances in kernel methods: support vector learning , 1999 .

[83] Bengt Persson,et al. KIND-a non-redundant protein database , 1999, Bioinform..

[84] Amos Bairoch,et al. The PROSITE database , 2005, Nucleic Acids Res..

[85] D. Lipman,et al. Rapid and sensitive protein similarity searches. , 1985, Science.

[86] M. Ashburner,et al. Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[87] Dan Wu,et al. EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[88] Robert D. Finn,et al. Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[89] Terri K. Attwood,et al. PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[90] S Henikoff,et al. Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[91] J. Thompson,et al. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[92] Haruki Nakamura,et al. Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[93] Peer Bork,et al. SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[94] Robert C. Edgar,et al. MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[95] T. Gingeras,et al. Genome-wide transcription and the implications for genomic organization , 2007, Nature Reviews Genetics.

[96] Tatiana Tatusova,et al. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[97] S. Vuilleumier. Bacterial glutathione S-transferases: what are they good for? , 1997, Journal of bacteriology.

[98] C. V. Jongeneel,et al. Making Sense of Score Statistics for Sequence Alignments , 2001, Briefings Bioinform..

[99] Jane Loveland,et al. VEGA, the genome browser with a difference , 2005, Briefings Bioinform..

[100] Erik L. L. Sonnhammer,et al. Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[101] Robert S. Ledley,et al. PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[102] James G. R. Gilbert,et al. The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[103] D. Lipman,et al. Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[104] Evelyn Camon,et al. The EMBL Nucleotide Sequence Database , 2004, Nucleic acids research.

[105] J. Hennig,et al. Structural, functional and immunologic characterization of folded subdomains in the Ro52 protein targeted in Sjögren's syndrome. , 2006, Molecular immunology.

[106] G. Schuler,et al. Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[107] B. Persson,et al. Common structural features of mapeg—a widespread superfamily of membrane associated proteins with highly divergent functions in eicosanoid and glutathione metabolism , 2008, Protein science : a publication of the Protein Society.

[108] L. Hug,et al. The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[109] Merlin Crossley,et al. Sticky fingers: zinc-fingers as protein-recognition motifs. , 2007, Trends in biochemical sciences.

[110] Michael Kaufmann,et al. BMC Bioinformatics BioMed Central , 2005 .

[111] I. Longden,et al. EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[112] Orna Man,et al. Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[113] V. Kuchroo,et al. Structural Organization and Zn2+-dependent Subdomain Interactions Involving Autoantigenic Epitopes in the Ring-B-box-Coiled-coil (RBCC) Region of Ro52* , 2005, Journal of Biological Chemistry.

[114] E. Birney,et al. EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[115] Robert D. Finn,et al. New developments in the InterPro database , 2007, Nucleic Acids Res..

[116] W. Pearson. Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[117] John D. Storey,et al. Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[118] E. Birney,et al. The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[119] E. Birney,et al. EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[120] J. Mattick,et al. Non-coding RNA. , 2006, Human molecular genetics.

[121] Rolf Apweiler,et al. The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[122] S. B. Needleman,et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[123] U. Sauer,et al. Getting Closer to the Whole Picture , 2007, Science.

[124] J. V. Moran,et al. Initial sequencing and analysis of the human genome. , 2001, Nature.

[125] Rolf Apweiler,et al. InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..