Statistical Mechanics of Transcription-Factor Binding Site Discovery Using Hidden Markov Models

Hidden Markov Models (HMMs) are a commonly used tool for inference of transcription factor (TF) binding sites from DNA sequence data. We exploit the mathematical equivalence between HMMs for TF binding and the “inverse” statistical mechanics of hard rods in a one-dimensional disordered potential to investigate learning in HMMs. We derive analytic expressions for the Fisher information, a commonly employed measure of confidence in learned parameters, in the biologically relevant limit where the density of binding sites is low. We then use techniques from statistical mechanics to derive a scaling principle relating the specificity (binding energy) of a TF to the minimum amount of training data necessary to learn it.

[1]  David J. Schwab,et al.  Nucleosome switches. , 2008, Physical review letters.

[2]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[3]  Proceedings of the Royal Society (London) , 1906, Science.

[4]  J. Herskowitz,et al.  Proceedings of the National Academy of Sciences, USA , 1996, Current Biology.

[5]  October I Physical Review Letters , 2022 .

[6]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[7]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[8]  A. Levine,et al.  New estimates of the storage permanence and ocean co-benefits of enhanced rock weathering , 2023, PNAS nexus.

[9]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[10]  W. Bialek,et al.  Maximum entropy models for antibody diversity , 2009, Proceedings of the National Academy of Sciences.

[11]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  The National Institute of Sciences of India , 1963, Nature.

[14]  Thomas Lengauer,et al.  Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology , 1998, ISMB 1999.

[15]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[16]  M. Tanner Trends in Biochemical Sciences , 1982 .

[17]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[18]  Michael J. Berry,et al.  Weak pairwise correlations imply strongly correlated network states in a neural population , 2005, Nature.

[19]  J. Mattick,et al.  Genome research , 1990, Nature.

[20]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[21]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[22]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[23]  BMC Bioinformatics , 2005 .

[24]  J. Kinney,et al.  Precise physical models of protein–DNA interaction from high-throughput data , 2007, Proceedings of the National Academy of Sciences.

[25]  Anirvan M. Sengupta,et al.  OHMM: a Hidden Markov Model accurately predicting the occupancy of a transcription factor with a self-overlapping binding motif , 2009, BMC Bioinformatics.

[26]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[27]  Amos Tanay,et al.  Sequence context affects the rate of short insertions and deletions in flies and primates , 2008, Genome Biology.

[28]  Anirvan M. Sengupta,et al.  A biophysical approach to transcription factor binding site discovery. , 2003, Genome research.

[29]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.