Recognizing Local and Global Structural Motifs at the Atomic Scale.

Most of the current understanding of structure-property relations at the molecular and the supramolecular scales can be formulated in terms of the stability of and the interactions between a limited number of recurring structural motifs (e.g., H-bonds, coordination polyhedra, and protein secondary structure). Here we demonstrate an algorithm to automatically recognize such patterns, based on the identification of local maxima in the probability distributions observed in atomistic computer simulations, which is robust to the dimensionality and the sparsity of the reference atomistic data. We first discuss its main features, demonstrating some on artificial data sets, and then show how it can be applied to identify coordination environments in Lennard-Jones clusters and to recognize secondary-structure patterns in the simulation of an oligopeptide. To assess the applicability of this algorithm for motifs that involve several interdependent degrees of freedom, we also employ it to identify groups of conformers of the cluster and the polypeptide, considered in their entirety. The motifs identified by analyzing atomistic simulations can be used to interpret and rationalize the stability and behavior of the system at hand, and also as a tool to accelerate sampling, in association with biased molecular dynamics schemes.

[1]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[2]  Mark A. Miller,et al.  Archetypal energy landscapes , 1998, Nature.

[3]  Michele Ceriotti,et al.  Probing Defects and Correlations in the Hydrogen-Bond Network of ab Initio Water. , 2016, Journal of chemical theory and computation.

[4]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5]  M. Parrinello,et al.  The unfolded ensemble and folding mechanism of the C-terminal GB1 beta-hairpin. , 2008, Journal of the American Chemical Society.

[6]  David J. Wales,et al.  Energy landscapes of model polyalanines , 2002 .

[7]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[8]  P. Steinhardt,et al.  Bond-orientational order in liquids and glasses , 1983 .

[9]  Peter G Bolhuis,et al.  Interplay between structure and size in a critical crystal nucleus. , 2005, Physical review letters.

[10]  Kanti V. Mardia,et al.  A multivariate von mises distribution with applications to bioinformatics , 2008 .

[11]  K. Dill,et al.  Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. , 2007, The Journal of chemical physics.

[12]  I. Jolliffe Principal Component Analysis , 2002 .

[13]  J. Behler Perspective: Machine learning potentials for atomistic simulations. , 2016, The Journal of chemical physics.

[14]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[15]  Marco Buongiorno Nardelli,et al.  The high-throughput highway to computational materials design. , 2013, Nature materials.

[16]  Pierre Baldi,et al.  ReactionPredictor: Prediction of Complex Chemical Reactions at the Mechanistic Level Using Machine Learning , 2012, J. Chem. Inf. Model..

[17]  Marcus Weber,et al.  Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification , 2013, Advances in Data Analysis and Classification.

[18]  Fabio Pietrucci,et al.  Graph theory meets ab initio molecular dynamics: atomic structures and transformations at the nanoscale. , 2011, Physical review letters.

[19]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[20]  Gábor Nagy,et al.  Dihedral-Based Segment Identification and Classification of Biopolymers I: Proteins , 2013, J. Chem. Inf. Model..

[21]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[22]  A. Choudhary,et al.  Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science , 2016 .

[23]  N. Go,et al.  Investigating protein dynamics in collective coordinate space. , 1999, Current opinion in structural biology.

[24]  Arun Mannodi-Kanakkithodi,et al.  Accelerated materials property predictions and design using motif-based fingerprints , 2015, 1503.07503.

[25]  Alfred O. Hero,et al.  Shrinkage Algorithms for MMSE Covariance Estimation , 2009, IEEE Transactions on Signal Processing.

[26]  Michele Parrinello,et al.  Variational approach to enhanced sampling and free energy calculations. , 2014, Physical review letters.

[27]  M. Karplus,et al.  The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics , 1997 .

[28]  Vijay S Pande,et al.  Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin. , 2004, The Journal of chemical physics.

[29]  Kurt Kremer,et al.  Research Update: Computational materials discovery in soft matter , 2016 .

[30]  Francesco Luigi Gervasio,et al.  From A to B in free energy space. , 2007, The Journal of chemical physics.

[31]  Boris Kozinsky,et al.  AiiDA: Automated Interactive Infrastructure and Database for Computational Science , 2015, ArXiv.

[32]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[33]  Miguel Á. Carreira-Perpiñán,et al.  Mode-Finding for Mixtures of Gaussian Distributions , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Frank Noé,et al.  Markov state models of biomolecular conformational dynamics. , 2014, Current opinion in structural biology.

[35]  Michele Parrinello,et al.  Simplifying the representation of complex free-energy landscapes using sketch-map , 2011, Proceedings of the National Academy of Sciences.

[36]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Maciej Haranczyk,et al.  Automatic Structure Analysis in High-Throughput Characterization of Porous Materials. , 2010, Journal of chemical theory and computation.

[38]  H. Abdi,et al.  Principal component analysis , 2010 .

[39]  Suvrit Sra,et al.  A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x) , 2012, Comput. Stat..

[40]  Michele Ceriotti,et al.  Mapping and classifying molecules from a high-throughput structural database , 2016, Journal of Cheminformatics.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  B. L. de Groot,et al.  Essential dynamics of reversible peptide folding: memory-free conformational dynamics governed by internal hydrogen bonds. , 2001, Journal of molecular biology.

[43]  Klaus Schulten,et al.  Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics , 2013, Nature.

[44]  Daniel J. Rosenkrantz,et al.  An analysis of several heuristics for the traveling salesman problem , 2013, Fundamental Problems in Computing.

[45]  Michele Ceriotti,et al.  Nuclear Quantum Effects in H(+) and OH(-) Diffusion along Confined Water Wires. , 2016, The journal of physical chemistry letters.

[46]  B. Berne,et al.  Spectral gap optimization of order parameters for sampling complex molecular systems , 2015, Proceedings of the National Academy of Sciences.

[47]  R. Levy,et al.  Protein folding pathways from replica exchange simulations and a kinetic network model. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[49]  Marcus Weber,et al.  Perron Cluster Analysis and Its Connection to Graph Partitioning for Noisy Data , 2004 .

[50]  Iosif I. Vaisman,et al.  Machine learning approach for structure-based zeolite classification , 2009 .

[51]  Bryce Meredig,et al.  Data mining our way to the next generation of thermoelectrics , 2016 .

[52]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[53]  Peter B. Littlewood,et al.  Preface: Special Topic on Materials Genome , 2016 .

[54]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[55]  M. Ceriotti,et al.  Mapping the conformational free energy of aspartic acid in the gas phase and in aqueous solution. , 2017, The Journal of chemical physics.

[56]  Michele Parrinello,et al.  Demonstrating the Transferability and the Descriptive Power of Sketch-Map. , 2013, Journal of chemical theory and computation.

[57]  G. N. Ramachandran,et al.  Stereochemistry of polypeptide chain configurations. , 1963, Journal of molecular biology.

[58]  S. K. Jain,et al.  Freezing of argon in ordered and disordered porous carbon , 2007 .

[59]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[60]  Gábor Csányi,et al.  Efficient sampling of atomic configurational spaces. , 2009, The journal of physical chemistry. B.

[61]  Dominique Durand,et al.  How Random are Intrinsically Disordered Proteins? A Small Angle Scattering Perspective , 2012, Current protein & peptide science.

[62]  K. Lindorff-Larsen,et al.  Picosecond to Millisecond Structural Dynamics in Human Ubiquitin. , 2016, The journal of physical chemistry. B.

[63]  M. Madan Babu,et al.  A million peptide motifs for the molecular biologist. , 2014, Molecular cell.

[64]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[65]  Michele Ceriotti,et al.  Recognizing molecular patterns by machine learning: an agnostic structural definition of the hydrogen bond. , 2014, The Journal of chemical physics.

[66]  Kristof T. Schütt,et al.  How to represent crystal structures for machine learning: Towards fast prediction of electronic properties , 2013, 1307.1266.

[67]  Michele Parrinello,et al.  Generalized neural-network representation of high-dimensional potential-energy surfaces. , 2007, Physical review letters.

[68]  J. Behler Atom-centered symmetry functions for constructing high-dimensional neural network potentials. , 2011, The Journal of chemical physics.

[69]  Frank Noé,et al.  Variational Approach to Molecular Kinetics. , 2014, Journal of chemical theory and computation.

[70]  Michele Parrinello,et al.  Probing the Unfolded Configurations of a β-Hairpin Using Sketch-Map. , 2015, Journal of chemical theory and computation.

[71]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[72]  Michele Parrinello,et al.  Using sketch-map coordinates to analyze and bias molecular dynamics simulations , 2012, Proceedings of the National Academy of Sciences.

[73]  Kanti V. Mardia,et al.  DISTRIBUTIONS ON SPHERES , 1972 .

[74]  Frederick R. Manby,et al.  Machine-learning approach for one- and two-body corrections to density functional theory: Applications to molecular and condensed water , 2013 .

[75]  A. Bowman,et al.  Applied smoothing techniques for data analysis : the kernel approach with S-plus illustrations , 1999 .

[76]  P. Deuflhard,et al.  Robust Perron cluster analysis in conformation dynamics , 2005 .

[77]  Felix A Faber,et al.  Machine Learning Energies of 2 Million Elpasolite (ABC_{2}D_{6}) Crystals. , 2015, Physical review letters.

[78]  S. Goedecker,et al.  Metrics for measuring distances in configuration spaces. , 2013, The Journal of chemical physics.

[79]  Nicola Marzari,et al.  Materials modelling: The frontiers and the challenges. , 2016, Nature materials.

[80]  Giovanni Bussi,et al.  Colored-Noise Thermostats à la Carte , 2010, 1204.0822.

[81]  Gerbrand Ceder,et al.  Predicting crystal structure by merging data mining with quantum mechanics , 2006, Nature materials.

[82]  Martin Vetterli,et al.  The effective rank: A measure of effective dimensionality , 2007, 2007 15th European Signal Processing Conference.

[83]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[84]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[85]  J. Doye,et al.  Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms , 1997, cond-mat/9803344.

[86]  P. Karplus,et al.  (φ,ψ)₂ motifs: a purely conformation-based fine-grained enumeration of protein parts at the two-residue level. , 2012, Journal of molecular biology.