A Maximum Entropy Formalism for Disentangling Chains of Correlated Sequence Positions

Covariation analysis of sets of aligned sequences of protein molecules is successful in certain instances in elucidating certain structural and functional links, but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure. In contrast, covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure. The goals of this paper are to (1) present the problem, (2) develop the mathematical formalism for solving the problem, and (3) validate the resulting algorithms on simulated data. Extensive application to biological sequences will be presented elsewhere.