Identifying localized changes in large systems: Change-point detection for biomolecular simulations

Significance With data proliferating across many disciplines, data analysts often wish to identify abrupt changes in complex systems with many measured quantities—a problem complicated by the fact that a given change might affect only a few of these quantities. We developed a method that accurately identifies changes in such systems by searching concurrently for change times and the subset of measured quantities that change at each of these times. This work was motivated by the challenge of detecting biologically interesting structural changes in proteins, but our method may prove useful in diverse application domains. Research on change-point detection, the classical problem of detecting abrupt changes in sequential data, has focused predominantly on datasets with a single observable. A growing number of time series datasets, however, involve many observables, often with the property that a given change typically affects only a few of the observables. We introduce a general statistical method that, given many noisy observables, detects points in time at which various subsets of the observables exhibit simultaneous changes in data distribution and explicitly identifies those subsets. Our work is motivated by the problem of identifying the nature and timing of biologically interesting conformational changes that occur during atomic-level simulations of biomolecules such as proteins. This problem has proved challenging both because each such conformational change might involve only a small region of the molecule and because these changes are often subtle relative to the ever-present background of faster structural fluctuations. We show that our method is effective in detecting biologically interesting conformational changes in molecular dynamics simulations of both folded and unfolded proteins, even in cases where these changes are difficult to detect using alternative techniques. This method may also facilitate the detection of change points in other types of sequential data involving large numbers of observables—a problem likely to become increasingly important as such data continue to proliferate in a variety of application domains.

[1]  J. Healy A note on multivariate CUSUM procedures , 1987 .

[2]  Christof Schütte,et al.  Sequential Change Point Detection in Molecular Dynamics Trajectories , 2012, Multiscale Model. Simul..

[3]  Manuel Davy,et al.  An online kernel change detection algorithm , 2005, IEEE Transactions on Signal Processing.

[4]  S. Panchapakesan,et al.  Inference about the Change-Point in a Sequence of Random Variables: A Selection Approach , 1988 .

[5]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[6]  A. N. PETTrrr A Non-parametric Approach to the Change-point Problem , 1979 .

[7]  Chung-Bow Lee,et al.  Estimating the Number of Change Points in Exponential Families Distributions , 1997 .

[8]  Jacques Bernier,et al.  Retrospective multivariate Bayesian change-point analysis: A simultaneous single change in the mean of several hydrological sequences , 2000 .

[9]  Joseph A. Bank,et al.  Supporting Online Material Materials and Methods Figs. S1 to S10 Table S1 References Movies S1 to S3 Atomic-level Characterization of the Structural Dynamics of Proteins , 2022 .

[10]  Alexander Fischer,et al.  Identification of biomolecular conformations from incomplete torsion angle observations by hidden markov models , 2007, J. Comput. Chem..

[11]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[12]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[13]  P. J. Huber The behavior of maximum likelihood estimates under nonstandard conditions , 1967 .

[14]  J. Hartigan,et al.  A Bayesian Analysis for Change Point Problems , 1993 .

[15]  Albert C. Pan,et al.  Activation mechanism of the β2-adrenergic receptor , 2011, Proceedings of the National Academy of Sciences.

[16]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[17]  Jeremy C. Smith,et al.  Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. , 2007, The Journal of chemical physics.

[18]  David S. Matteson,et al.  A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data , 2013, 1306.4933.

[19]  Jeffrey D. Scargle,et al.  An algorithm for optimal partitioning of data on an interval , 2003, IEEE Signal Processing Letters.

[20]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[21]  Kang G. Shin,et al.  Change-point monitoring for the detection of DoS attacks , 2004, IEEE Transactions on Dependable and Secure Computing.

[22]  Peter A Rogerson,et al.  Monitoring change in spatial patterns of disease: comparing univariate and multivariate cumulative sum approaches , 2004, Statistics in medicine.

[23]  Stefano Piana,et al.  Automated Event Detection and Activity Monitoring in Long Molecular Dynamics Simulations. , 2009, Journal of chemical theory and computation.

[24]  K. Dill,et al.  Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. , 2007, The Journal of chemical physics.

[25]  Ashish Sen,et al.  ON MULT1VARIATE TESTS FOR DETECTING CHANGE IN MEAN , 2016 .

[26]  David Siegmund,et al.  MODEL SELECTION FOR HIGH-DIMENSIONAL, MULTI-SEQUENCE CHANGE-POINT PROBLEMS , 2012 .

[27]  R. Dror,et al.  How Fast-Folding Proteins Fold , 2011, Science.

[28]  Douglas M. Hawkins,et al.  Detection of multiple change-points in multivariate data , 2013 .

[29]  Olivier Capp'e,et al.  Homogeneity and change-point detection tests for multivariate data using rank statistics , 2011, 1107.1971.

[30]  Nancy R. Zhang,et al.  Detecting simultaneous variant intervals in aligned sequences , 2011, 1108.3177.

[31]  Jean-Philippe Vert,et al.  The group fused Lasso for multiple change-point detection , 2011, 1106.4199.

[32]  Thomas J Lane,et al.  MSMBuilder2: Modeling Conformational Dynamics at the Picosecond to Millisecond Scale. , 2011, Journal of chemical theory and computation.

[33]  Joseph P. Romano,et al.  The stationary bootstrap , 1994 .

[34]  David Siegmund,et al.  Sequential multi-sensor change-point detection , 2013, 2013 Information Theory and Applications Workshop (ITA).

[35]  D. Siegmund,et al.  Tail approximations for maxima of random fields , 1992 .

[36]  J. P. Grossman,et al.  Biomolecular simulation: a computational microscope for molecular biology. , 2012, Annual review of biophysics.

[37]  Kyle A. Beauchamp,et al.  Markov state model reveals folding and functional dynamics in ultra-long MD trajectories. , 2011, Journal of the American Chemical Society.

[38]  Vijay S Pande,et al.  Bayesian detection of intensity changes in single molecule and molecular dynamics trajectories. , 2010, The journal of physical chemistry. B.

[39]  Vijay S Pande,et al.  Simple few-state models reveal hidden complexity in protein folding , 2012, Proceedings of the National Academy of Sciences.

[40]  Vijay S. Pande,et al.  Accelerating molecular dynamic simulation on graphics processing units , 2009, J. Comput. Chem..

[41]  Yi-Ching Yao Estimating the number of change-points via Schwarz' criterion , 1988 .

[42]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[43]  Andrej J. Savol,et al.  Event detection and sub‐state discovery from biomolecular simulations using higher‐order statistics: Application to enzyme adenylate kinase , 2012, Proteins.

[44]  Jean-Philippe Vert,et al.  Fast detection of multiple change-points shared by many signals using group LARS , 2010, NIPS.

[45]  D. Hawkins Multivariate quality control based on regression-adjusted variables , 1991 .

[46]  Hongzhe Li,et al.  Simultaneous Discovery of Rare and Common Segment Variants. , 2013, Biometrika.

[47]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .