Identifying protein binding functionality of protein family sequences by Aligned Pattern clusters

A basic task in protein analysis is to discover a set of sequence patterns that reflect the function of a protein family. This set of sequence patterns contains non-exact significant residue associations. Currently, the existing combinatorial methods are computationally expensive and probabilistic methods require richer representation of the amino acid associations. To undertake this task, we create a synthesized pattern representation called an Aligned Pattern (AP) Cluster that identifies the residue associations in the binding segment and the site variations in the aligned residues. In this paper, our algorithm identifies the binding segments for two protein families: the Cytochrome Complex and the Ubiquitin protein families. For each of the experiments, the AP Clusters obtained correspond to protein binding segments including a few beyond those identified by the other protein databases, PROSITE and pFam. Furthermore, the columns of aligned sites that exist only as a single value in the AP Clusters also corresponds to the binding residues. Additional information retained by the AP Clusters can reveal the amino acid residues of interest, thus averting time-consuming simulations and experimentation.

[1]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[2]  Todd Wareham,et al.  On the complexity of finding common approximate substrings , 2003, Theor. Comput. Sci..

[3]  Andrew K. C. Wong,et al.  Synthesizing Aligned Random Pattern Digraphs from protein sequence patterns , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[4]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[5]  Andrew K. C. Wong,et al.  Discovery of Non-induced Patterns from Sequences , 2010, PRIB.

[6]  Hashim M. Al-Hashimi,et al.  Functional complexity and regulation through RNA dynamics , 2012, Nature.

[7]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[9]  Mehmet M. Dalkilic,et al.  An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences , 2006, VDMB.

[10]  Andrew K. C. Wong,et al.  Pattern detection in biomolecules using synthesized random sequence , 1996, Pattern Recognit..

[11]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[12]  Andrew K. C. Wong,et al.  Synthesis and Recognition of Sequences , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[14]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.