Hybrid Spectral/Subspace Clustering of Molecular Dynamics Simulations

Data clustering approaches are widely used in many domains including molecular dynamics (MD) simulation. Modern applications of clustering for MD simulation data must be capable of assessing both natively folded and disordered proteins. We compare the performance of the spectral clustering with a more recent subspace clustering approach, and a newly proposed 'hybrid' clustering algorithm which seeks to combine the useful characteristics of both methods on MD data from both protein classes. Results are analysed in terms of accuracy, stability, data density, and other properties. We conclude with what combinations of algorithms/improvements/data density will provide results that are either more accurate or more stable. We find that subspace clustering produces better results than standard spectral clustering, especially for disordered proteins and regardless of input data density or choice of affinity scaling. Additionally, our hybrid approach improves subspace results in most cases and entropic affinity scaling leads to a better performance of both spectral clustering and our hybrid approach.

[1]  Miguel Á. Carreira-Perpiñán,et al.  Entropic Affinities: Properties and Efficient Numerical Computation , 2013, ICML.

[2]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[3]  Alexander D. MacKerell,et al.  Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone φ, ψ and side-chain χ(1) and χ(2) dihedral angles. , 2012, Journal of chemical theory and computation.

[4]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[5]  R. Dror,et al.  Improved side-chain torsion potentials for the Amber ff99SB protein force field , 2010, Proteins.

[6]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7]  S. Newsam,et al.  Analyzing dynamical simulations of intrinsically disordered proteins using spectral clustering , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[8]  J. W. Neidigh,et al.  Designing a 20-residue protein , 2002, Nature Structural Biology.

[9]  Yuhua Wen,et al.  Cluster analysis of accelerated molecular dynamics simulations: A case study of the decahedron to icosahedron transition in Pt nanoparticles. , 2017, The Journal of chemical physics.

[10]  Hans-Christian Hege,et al.  Visualizing and identifying conformational ensembles in molecular dynamics trajectories , 2002, Comput. Sci. Eng..

[11]  A. Gronenborn,et al.  A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein G. , 1993, Science.

[12]  Joshua L. Phillips,et al.  A Bimodal Distribution of Two Distinct Categories of Intrinsically Disordered Structures with Separate Functions in FG Nucleoporins* , 2010, Molecular & Cellular Proteomics.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  Sarah Rauscher,et al.  Molecular simulations of protein disorder. , 2010, Biochemistry and cell biology = Biochimie et biologie cellulaire.

[15]  C. Brooks,et al.  Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. , 1993, Biochemistry.

[16]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[18]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[19]  Shawn D. Newsam,et al.  Validating clustering of molecular dynamics simulations using polymer models , 2011, BMC Bioinformatics.