An efficient statistical methodology for peptide 3D structure clustering

The analysis of proteins and peptides conformations is of crucial interest to gain insights on their biological functions; it has therefore been an active research topic over the past decades. However, analyzing conformations of small size and highly flexible peptides remains a challenge due to their instability and their large number of different shapes. In this paper, an efficient methodology is proposed to analyze 3D structures of highly flexible elastin-derived peptides and to find out their principal conformations using a clustering algorithm. This methodology is based on a special representation of peptide structures, which has the great advantage to be neither affected by peptides' translations nor rotations, hence, avoiding the use of a complex superposition method. In addition, the proposed approach uses for the first time Kernel PCA to remove outlier structures that are not frequently present and do not resemble any other peptide structures. Outlier removal is very important in this context because, due to the instability of those peptides, a small portion of very different conformations, that seldom occur, can heavily affect the ensuing clustering results. Finally, the proposed approach latest step consists in hierarchical clustering, used as a non-supervised classification method to gather together similar structures. Experimental results, obtained using an existing database, show the relevance and the efficiency of the proposed method.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Sanghamitra Bandyopadhyay,et al.  An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection , 2005, Fuzzy Sets Syst..

[3]  Andrew E. Torda,et al.  Algorithms for clustering molecular dynamics configurations , 1994, J. Comput. Chem..

[4]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[5]  Fabio Stella,et al.  Conformational and functional analysis of molecular dynamics trajectories by Self-Organising Maps , 2011, BMC Bioinformatics.

[6]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[7]  David A. Clifton,et al.  A review of novelty detection , 2014, Signal Process..

[8]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[9]  Andrea Cavalli,et al.  AClAP, Autonomous hierarchical agglomerative Cluster Analysis based protocol to partition conformational datasets , 2006, ISMB.

[10]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[12]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[13]  Ioannis G Kevrekidis,et al.  Diffusion maps, clustering and fuzzy Markov modeling in peptide folding transitions. , 2014, The Journal of chemical physics.

[14]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[15]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[16]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[17]  Wei Wu,et al.  An efficient multiple protein structure comparison method and its application to structure clustering and outlier detection , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[18]  Anuj Srivastava,et al.  Shape Analysis of Elastic Curves in Euclidean Spaces , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Leonidas J. Guibas,et al.  Persistence-Based Clustering in Riemannian Manifolds , 2013, JACM.

[20]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[21]  Santanu Kumar Rath,et al.  An efficient technique for protein classification using feature extraction by artificial neural networks , 2010, 2010 Annual IEEE India Conference (INDICON).

[22]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .