Aligning and Clustering Patterns to Reveal the Protein Functionality of Sequences

Discovering sequence patterns with variations unveils significant functions of a protein family. Existing combinatorial methods of discovering patterns with variations are computationally expensive, and probabilistic methods require more elaborate probabilistic representation of the amino acid associations. To overcome these shortcomings, this paper presents a new computationally efficient method for representing patterns with variations in a compact representation called Aligned Pattern Cluster (AP Cluster). To tackle the runtime, our method discovers a shortened list of non-redundant statistically significant sequence associations based on our previous work. To address the representation of protein functional regions, our pattern alignment and clustering step, presented in this paper captures the conservations and variations of the aligned patterns. We further refine our solution to allow more coverage of sequences via extending the AP Clusters containing only statistically significant patterns to Weak and Conserved AP Clusters. When applied to the cytochrome c, the ubiquitin, and the triosephosphate isomerase protein families, our algorithm identifies the binding segments as well as the binding residues. When compared to other methods, ours discovers all binding sites in the AP Clusters with superior entropy and coverage. The identification of patterns with variations help biologists to avoid time-consuming simulations and experimentations. (Software available upon request).

[1]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[2]  H. Roder,et al.  Identification of the predominant non-native histidine ligand in unfolded cytochrome c. , 1997, Biochemistry.

[3]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[4]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[5]  Sanguthevar Rajasekaran,et al.  qPMS7: A Fast Algorithm for Finding (ℓ, d)-Motifs in DNA and Protein Sequences , 2012, PloS one.

[6]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Andrew K. C. Wong,et al.  Pattern detection in biomolecules using synthesized random sequence , 1996, Pattern Recognit..

[8]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[9]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[10]  JohnB . Taylor,et al.  Cytochrome c Biogenesis: Mechanisms for Covalent Modifications and Trafficking of Heme and for Heme-Iron Redox Control , 2009, Microbiology and Molecular Biology Reviews.

[11]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[12]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[13]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[14]  Alexander Zelikovsky,et al.  Bioinformatics Algorithms: Techniques and Applications , 2008 .

[15]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[16]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[17]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[18]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[19]  Aleksandrushkina Ni,et al.  Nucleotide makeup of the DNA of thermophilic bacteria of the genus Thermus , 1978 .

[20]  Tina L Amyes,et al.  Magnitude and origin of the enhanced basicity of the catalytic glutamate of triosephosphate isomerase. , 2013, Journal of the American Chemical Society.

[21]  Andrew K. C. Wong,et al.  Synthesis and Recognition of Sequences , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  L A Egorova,et al.  [Nucleotide makeup of the DNA of thermophilic bacteria of the genus Thermus]. , 1978, Mikrobiologiia.

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[25]  Andrew K. C. Wong,et al.  Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences , 2012, IEEE Transactions on Knowledge and Data Engineering.

[26]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[27]  Andrew K. C. Wong,et al.  Confirming biological significance of co-occurrence clusters of aligned pattern clusters , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[28]  Eugene V Koonin,et al.  A community experiment with fully open and published peer review , 2006, Biology Direct.

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  Andrew K. C. Wong,et al.  Classifying Proteins by Amino Acid Variations of Sequential Patterns , 2013, BCB.

[31]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[32]  James W. A. Allen,et al.  C-type cytochrome formation: chemical and biological enigmas. , 2004, Accounts of chemical research.

[33]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[34]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[35]  R. Wierenga,et al.  Triosephosphate isomerase: a highly evolved biocatalyst , 2010, Cellular and Molecular Life Sciences.

[36]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[37]  Mehmet M. Dalkilic,et al.  An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences , 2006, VDMB.

[38]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[39]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[40]  En-Shiun Annie Lee,et al.  Ranking and compacting binding segments of protein families using aligned pattern clusters , 2013, Proteome Science.