Statistical Inference for Big Data Problems in Molecular Biophysics

We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technological and algorithmic improvements in computation have brought molecular simulations to the forefront of techniques applied to investigating the basis of living systems. While these longer simulations, increasingly complex reaching petabyte scales presently, promise a detailed view into microscopic behavior, teasing out the important information has now become a true challenge on its own. Mining this data for important patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mechanistic basis of cellular homeostasis. 1 Molecular Biophysics Over last 30 years biophysicists have taken advantage of the advances in computing power to run increasingly detailed simulations of biomolecules in order to investigate the mechanistic basis of their function. The structure, dynamics and function of biological macro-molecules such as proteins, de-oxy/ribose nucleic acid (DNA/RNA), carbohydrates and lipids control cellular function, and thus life. Proteins, the workhorses of the cell, are long polymers of amino-acid residues which fold into three-dimensional structures to perform their function. The biological function controlled by the dynamical interactions between various bio-molecules can occur at multiple time-scales from femto-seconds up to micro-, milli-, seconds and beyond, spanning more than 15 orders of magnitude between them. Molecular dynamics (MD) simulations provide insights into the dependence of biological function on interactions at multiple length and time scales. In this paper, we focus on using fully-atomistic simulations of proteins/biomolecules in solution as they best represent the cellular environment. MD simulations are governed by a potential energy function that includes both bonded and non-bonded interaction terms. The gradient of the energy function defines a force-field which is then applied to every atom in the molecule. At each time step, Newton’s laws of motion are integrated to generate a trajectory. A time-step on the order of a femtosecond (10−15s) is necessary for capturing the smallest vibrations of interest, whereas biological interesting events typically occur at microsecond (10−6s) and higher time scales. With improvements in sampling techniques and available hardware resources, MD simulations have successfully crossed the millisecond (10−3s) time-scale barrier [1] and have provided novel insights into the functioning of bimolecular systems.

[1]  Oliver F. Lange,et al.  Full correlation analysis of conformational protein dynamics , 2007, Proteins.

[2]  Arvind Ramanathan,et al.  Quasi-Anharmonic Analysis Reveals Intermediate States in the Nuclear Co-Activator Receptor Binding Domain Ensemble , 2012, Pacific Symposium on Biocomputing.

[3]  Stefano Piana,et al.  Automated Event Detection and Activity Monitoring in Long Molecular Dynamics Simulations. , 2009, Journal of chemical theory and computation.

[4]  Oliver Beckstein,et al.  MDAnalysis: A toolkit for the analysis of molecular dynamics simulations , 2011, J. Comput. Chem..

[5]  D. Kern,et al.  Hidden alternate structures of proline isomerase essential for catalysis , 2010 .

[6]  Jean-Franois Cardoso High-Order Contrasts for Independent Component Analysis , 1999, Neural Computation.

[7]  John L. Klepeis,et al.  Millisecond-scale molecular dynamics simulations on Anton , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  J. D. Morgan,et al.  Molecular dynamics of ferrocytochrome c. Magnitude and anisotropy of atomic displacements. , 1981, Journal of molecular biology.

[9]  Vijay S Pande,et al.  Progress and challenges in the automated construction of Markov state models for full protein systems. , 2009, The Journal of chemical physics.

[10]  Andrej J. Savol,et al.  Event detection and sub‐state discovery from biomolecular simulations using higher‐order statistics: Application to enzyme adenylate kinase , 2012, Proteins.

[11]  James Andrew McCammon,et al.  Accessing a Hidden Conformation of the Maltose Binding Protein Using Accelerated Molecular Dynamics , 2011, PLoS Comput. Biol..

[12]  Arvind Ramanathan,et al.  On-the-Fly Identification of Conformational Substates from Molecular Dynamics Simulations. , 2011, Journal of chemical theory and computation.

[13]  R Dustin Schaeffer,et al.  Dynameomics: a comprehensive database of protein dynamics. , 2010, Structure.

[14]  Jianyin Shao,et al.  Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms. , 2007, Journal of chemical theory and computation.

[15]  L. Kay,et al.  Transiently populated intermediate functions as a branching point of the FF domain folding pathway , 2012, Proceedings of the National Academy of Sciences.

[16]  Dan Brandt,et al.  Investigation of GPGPU for use in processing of EEG in real-time , 2010 .

[17]  Arvind Ramanathan,et al.  Discovering Conformational Sub-States Relevant to Protein Function , 2011, PloS one.

[18]  John L. Klepeis,et al.  A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.