3PFDB - A database of Best Representative PSSM Profiles (BRPs) of Protein Families generated using a novel data mining approach

BackgroundProtein families could be related to each other at broad levels that group them as superfamilies. These relationships are harder to detect at the sequence level due to high evolutionary divergence. Sequence searches are strongly directed and influenced by the best representatives of families that are viewed as starting points. PSSMs are useful approximations and mathematical representations of protein alignments, with wide array of applications in bioinformatics approaches like remote homology detection, protein family analysis, detection of new members and evolutionary modelling. Computational intensive searches have been performed using the neural network based sensitive sequence search method called FASSM to identify the Best Representative PSSMs for families reported in Pfam database version 22.ResultsWe designed a novel data mining approach for the assessment of individual sequences from a protein family to identify a single Best Representative PSSM profile (BRP) per protein family. Using the approach, a database of protein family-specific best representative PSSM profiles called 3PFDB has been developed. PSSM profiles in 3PFDB are curated using performance of individual sequence as a reference in a rigorous scoring and coverage analysis approach using FASSM. We have assessed the suitability of 10, 85,588 sequences derived from seed or full alignments reported in Pfam database (Version 22). Coverage analysis using FASSM method is used as the filtering step to identify the best representative sequence, starting from full length or domain sequences to generate the final profile for a given family. 3PFDB is a collection of best representative PSSM profiles of 8,524 protein families from Pfam database.ConclusionAvailability of an approach to identify BRPs and a curated database of best representative PSI-BLAST derived PSSMs for 91.4% of current Pfam family will be a useful resource for the community to perform detailed and specific analysis using family-specific, best-representative PSSM profiles. 3PFDB can be accessed using the URL: http://caps.ncbs.res.in/3pfdb

[1]  John Coggeshall,et al.  The MySQL Database , 2009 .

[2]  K. Blumer,et al.  RGS family members: GTPase-activating proteins for heterotrimeric G-protein α-subunits , 1996, Nature.

[3]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[4]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[5]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[6]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[7]  Ramanathan Sowdhamini,et al.  FASSM: Enhanced Function Association in Whole Genome Analysis Using Sequence and Structural Motifs , 2005, Silico Biol..

[8]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[9]  S. Heximer,et al.  RGS Proteins: Swiss Army Knives in Seven-Transmembrane Domain Receptor Signaling Networks , 2007, Science's STKE.

[10]  Morten Nielsen,et al.  NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8–11 , 2008, Nucleic Acids Res..

[11]  Robert D. Finn,et al.  Pfam 10 years on: 10 000 families and still growing , 2008, Briefings Bioinform..

[12]  S. Henikoff,et al.  Scores for sequence searches and alignments. , 1996, Current opinion in structural biology.

[13]  Niall J. Haslam,et al.  Understanding eukaryotic linear motifs and their role in cell signaling and regulation. , 2008, Frontiers in bioscience : a journal and virtual library.

[14]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[15]  Xiangjun Liu,et al.  GNBSL: A new integrative system to predict the subcellular location for Gram‐negative bacteria proteins , 2006, Proteomics.

[16]  J. Janin,et al.  High resolution crystal structures of T4 phage beta-glucosyltransferase: induced fit and effect of substrate and metal binding. , 2001, Journal of molecular biology.

[17]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[18]  Oruganty Krishnadev,et al.  MulPSSM: a database of multiple position-specific scoring matrices of protein domain families , 2005, Nucleic Acids Res..

[19]  Jens Meiler,et al.  BCL::Align-sequence alignment and fold recognition with a custom scoring function online. , 2008, Gene.

[20]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[21]  K. Tanaka,et al.  Molecular cloning and nucleotide sequence of cDNAs encoding the precursors of rat long chain acyl-coenzyme A, short chain acyl-coenzyme A, and isovaleryl-coenzyme A dehydrogenases. Sequence homology of four enzymes of the acyl-CoA dehydrogenase family. , 1989, The Journal of biological chemistry.

[22]  Pradeep Kumar Naik,et al.  Prediction of enzymes and non-enzymes from protein sequences based on sequence derived features and PSSM matrix using artificial neural network , 2007, Bioinformation.

[23]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[24]  A. Düsterhöft,et al.  Nucleotide sequence of the Bacillus subtilis temperate bacteriophage SPbetac2. , 1999, Microbiology.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[27]  Gary B. Fogel,et al.  Computational intelligence approaches for pattern discovery in biological systems , 2008, Briefings Bioinform..

[28]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[29]  K. Blumer,et al.  RGS family members: GTPase-activating proteins for heterotrimeric G-protein alpha-subunits. , 1996, Nature.

[30]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[31]  John P. Overington,et al.  Alignment and searching for common protein folds using a data bank of structural templates. , 1993, Journal of molecular biology.

[32]  Yu-Yen Ou,et al.  Protein disorder prediction by condensed PSSM considering propensity for order or disorder , 2006, BMC Bioinformatics.

[33]  Janet M. Thornton,et al.  Understanding the molecular machinery of genetics through 3D structures , 2008, Nature Reviews Genetics.

[34]  T. Blundell,et al.  Knowledge-based protein modeling. , 1994, Critical reviews in biochemistry and molecular biology.

[35]  Dinesh Gupta,et al.  CyclinPred: A SVM-Based Method for Predicting Cyclin Protein Sequences , 2008, PloS one.

[36]  K. Tanaka,et al.  Molecular basis of isovaleric acidemia and medium-chain acyl-CoA dehydrogenase deficiency. , 1987, Enzyme.

[37]  Igor B. Kuznetsov,et al.  DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins , 2007, Bioinform..

[38]  N Srinivasan,et al.  Assessment of a Rigorous Transitive Profile Based Search Method to Detect Remotely Similar Proteins , 2005, Journal of biomolecular structure & dynamics.

[39]  Darby Tien-Hao Chang,et al.  Real value prediction of protein solvent accessibility using enhanced PSSM features , 2008, BMC Bioinformatics.

[40]  T. Hashimoto,et al.  Rat very-long-chain acyl-CoA dehydrogenase, a novel mitochondrial acyl-CoA dehydrogenase gene product, is a rate-limiting enzyme in long-chain fatty acid beta-oxidation system. cDNA and deduced amino acid sequence and distinct specificities of the cDNA-expressed protein. , 1994, The Journal of biological chemistry.

[41]  Yi Guo,et al.  Crystal structure of Mycoplasma arthritidis mitogen complexed with HLA-DR1 reveals a novel superantigen fold and a dimerized superantigen-MHC complex. , 2004, Structure.