The tetratricopeptide repeats (TPR)-like superfamily of proteins in Leishmania spp., as revealed by multi-relational data mining

Protein sequence analysis tasks are multi-relational problems suitable for multi-relational data mining (MRDM). Proteins containing tetratricopeptide (TPR), pentatricopeptide (PPR) and half-a-TPR (HAT) repeats comprise the TPR-like superfamily in which we have applied MRDM methods (relational association rule discovery and probabilistic relational models) with hidden Markov models (HMMs) and Viterbi algorithm (VA) in genome databases of pathogenic protozoa Leishmania. Such integrated MRDM/HMM/VA approach seeks to capture as much model information as possible in the pattern matching heuristic, without resorting to more standard motif discovery methods (Pfam, SMART, SUPERFAMILY) and it has the advantage of incorporation of optimized profiles, score offsets and distribution to compute probability, as a more recently reported tool (TPRpred) in order to take in account the tendency of repeats to occur in tandem and to be widely distributed along the sequences. Here we compare such currently available resources with our approach (MRDM/HMM/VA) to highlight that the latter performs best into the TPR-like superfamily assignment and it might be applied to other sequence analysis problems in such a way that it contributes to tight-fit motif discoveries and a better probability that a given target sequence is, indeed, a target motif-containing protein.

[1]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[2]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[3]  G. Blatch,et al.  The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. , 1999, BioEssays : news and reviews in molecular, cellular and developmental biology.

[4]  Eric Rivals,et al.  Formation of the Arabidopsis Pentatricopeptide Repeat Family1[W] , 2006, Plant Physiology.

[5]  Luis Moroder,et al.  Structure of TPR Domain–Peptide Complexes Critical Elements in the Assembly of the Hsp70–Hsp90 Multichaperone Machine , 2000, Cell.

[6]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[7]  David Page,et al.  Biological applications of multi-relational data mining , 2003, SKDD.

[8]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[9]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[10]  L. Regan,et al.  A Direct Interaction between the Utp6 Half-a-Tetratricopeptide Repeat Domain and a Specific Peptide in Utp21 Is Essential for Efficient Pre-rRNA Processing , 2008, Molecular and Cellular Biology.

[11]  Diana Magalhaes de Oliveira,et al.  Multi-relational Data Mining for Tetratricopeptide Repeats (TPR)-Like Superfamily Members in Leishmania spp.: Acting-by-Connecting Proteins , 2008, PRIB.

[12]  Frédérique Bitton,et al.  Genome-Wide Analysis of Arabidopsis Pentatricopeptide Repeat Proteins Reveals Their Essential Role in Organelle Biogenesis , 2004, The Plant Cell Online.

[13]  Jean-Christophe Nebel,et al.  A stochastic context free grammar based framework for analysis of protein sequences , 2009, BMC Bioinformatics.

[14]  Matthew Berriman,et al.  GeneDB: a resource for prokaryotic and eukaryotic organisms , 2004, Nucleic Acids Res..

[15]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[16]  Benny Y. M. Fung,et al.  Classification of heterogeneous gene expression data , 2003, SKDD.

[17]  Philippa Rhodes,et al.  ApiDB: integrated resources for the apicomplexan bioinformatics resource center , 2006, Nucleic Acids Res..

[18]  D. Barford,et al.  Topological characteristics of helical repeat proteins. , 1999, Current opinion in structural biology.

[19]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[20]  T. Shikanai,et al.  A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts , 2005, Nature.

[21]  Luc De Raedt,et al.  Mining Association Rules in Multiple Relations , 1997, ILP.

[22]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[23]  Diana Magalhaes de Oliveira,et al.  Hidden Markov models and the Viterbi algorithm applied to integrated bioinformatics analyses of putative flagellar actin-interacting proteins in Leishmania spp , 2009, Int. J. Comput. Aided Eng. Technol..

[24]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[25]  Jérôme Gouzy,et al.  ProDom: Automated Clustering of Homologous Domains , 2002, Briefings Bioinform..

[26]  A. Schneider,et al.  Pentatricopeptide Repeat Proteins in Trypanosoma brucei Function in Mitochondrial Ribosomes , 2007, Molecular and Cellular Biology.

[27]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[28]  M. Mingler,et al.  Identification of pentatricopeptide repeat proteins in Trypanosoma brucei. , 2006, Molecular and biochemical parasitology.

[29]  Following the Viterbi Path to Deduce Flagellar Actin‐Interacting Proteins of Leishmania spp.: Report on Cofilins and Twinfilins , 2007 .

[30]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[31]  Richa Agarwala,et al.  COBALT: constraint-based alignment tool for multiple protein sequences , 2007, Bioinform..

[32]  B. Kobe,et al.  When protein folding is simplified to protein coiling: the continuum of solenoid protein structures. , 2000, Trends in biochemical sciences.

[33]  Martin Madera,et al.  Profile Comparer: a program for scoring and aligning profile hidden Markov models , 2008, Bioinform..

[34]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[35]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[36]  Arne Elofsson,et al.  Expansion of Protein Domain Repeats , 2006, PLoS Comput. Biol..

[37]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[38]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[39]  Peer Bork,et al.  SMART 6: recent updates and new developments , 2008, Nucleic Acids Res..

[40]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[41]  Lynne Regan,et al.  TPR proteins: the versatile helix. , 2003, Trends in biochemical sciences.

[42]  Vineet Bafna,et al.  Integrating scientific cultures , 2007, Molecular systems biology.

[43]  John Grant,et al.  PRL: A probabilistic relational language , 2006, Machine Learning.

[44]  Yang Huang,et al.  Combining Text Classification and Hidden Markov Modeling Techniques for Structuring Randomized Clinical Trial Abstracts , 2006, AMIA.

[45]  David Page,et al.  A Probabilistic Learning Approach to Whole-Genome Operon Prediction , 2000, ISMB.

[46]  Johannes Söding,et al.  TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences , 2007, BMC Bioinformatics.

[47]  Stephen Winters-Hilt Hidden Markov Model Variants and their Application , 2006, BMC Bioinformatics.

[48]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[49]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[50]  Ian Small,et al.  On the expansion of the pentatricopeptide repeat gene family in plants. , 2008, Molecular biology and evolution.

[51]  F. Inagaki,et al.  Tetratricopeptide Repeat (TPR) Motifs of p67 phox Participate in Interaction with the Small GTPase Rac and Activation of the Phagocyte NADPH Oxidase* , 1999, The Journal of Biological Chemistry.

[52]  W. Keller,et al.  The HAT helix, a repetitive motif implicated in RNA processing. , 1998, Trends in biochemical sciences.

[53]  D. Barford,et al.  The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR‐mediated protein–protein interactions , 1998, The EMBO journal.

[54]  Tobias Müller,et al.  Modelling interaction sites in protein domains with interaction profile hidden Markov models , 2006, Bioinform..

[55]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[56]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[57]  Sophie E Jackson,et al.  A recurring theme in protein engineering: the design, stability and folding of repeat proteins. , 2005, Current opinion in structural biology.

[58]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[59]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[60]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[61]  Witold Dyrka,et al.  Probabilistic context-free grammar for pattern detection in protein sequences , 2007 .

[62]  I. Small,et al.  The PPR motif - a TPR-related motif prevalent in plant organellar proteins. , 2000, Trends in biochemical sciences.

[63]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[64]  Ipseeta Satpathy,et al.  Innovation: The survival mantra for gramya banks (an empirical analysis of innovative initiatives of Gramya banks in Odisha) , 2011, BIOINFORMATICS 2011.