Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints

Discovering protein structural signatures directly from their primary information is a challenging task, because the residues associated with a functional motif are not necessarily clustered in one region of the sequence. This work proposes an algorithm that aims to discover conserved sequential blocks interleaved by large irregular gaps from a set of unaligned biological sequences. Different from the previous works that employ only one type of constraint on gap flexibility, we propose using combination of intra- and inter-block gap constraints to discover longer patterns with larger irregular gaps. The smaller flexible intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of mining process while keeping high accuracy of mining results. The efficiency of the algorithm also helps to identify functional motifs that are conserved in only a small subset of the input sequences.

[1]  Yuan Shi,et al.  Contributions of cysteine residues in Zn2 to zinc fingers and thiol-disulfide oxidoreductase activities of chaperone DnaJ. , 2005, Biochemistry.

[2]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[3]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[4]  Jian Pei,et al.  Constrained frequent pattern mining: a pattern-growth view , 2002, SKDD.

[5]  Benno Schwikowski,et al.  An Exact Algorithm to Identify Motifs in Orthologous Sequences from Multiple Species , 2000, ISMB.

[6]  William R. Taylor,et al.  Protein bioinformatics - an algorithmic approach to sequence and structure analysis , 2004 .

[7]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[8]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[9]  Salvatore Orlando,et al.  A new algorithm for gap constrained sequence mining , 2004, SAC '04.

[10]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[11]  P E Wright,et al.  Solution structure of the cysteine-rich domain of the Escherichia coli chaperone protein DnaJ. , 2000, Journal of molecular biology.

[12]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[13]  M Kanehisa,et al.  Construction of a dictionary of sequence motifs that characterize groups of related proteins , 1992, Protein engineering.

[14]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[15]  M J Sternberg,et al.  Identification of sequence motifs from a set of proteins with related function. , 1994, Protein engineering.

[16]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Dimitrios I. Fotiadis,et al.  Greedy mixture learning for multiple motif discovery in biological sequences , 2003, Bioinform..

[18]  B. Edwards,et al.  Insights into the structure, solvation, and mechanism of ArsC arsenate reductase, a novel arsenic detoxification enzyme. , 2001, Structure.

[19]  Giri Narasimhan,et al.  Mining Protein Sequences for Motifs , 2002, J. Comput. Biol..

[20]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[21]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[22]  Lin Lu,et al.  eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity , 2005, Nucleic Acids Res..