Statistical properties of interaction parameter estimates in direct coupling analysis

We consider the statistical properties of interaction parameter estimates obtained by the direct coupling analysis (DCA) approach to learning interactions from large data sets. Assuming that the data are generated from a random background distribution, we determine the distribution of inferred interactions. Two inference methods are considered: the L2 regularized naive mean-field inference procedure (regularized least squares, RLS), and the pseudo-likelihood maximization (plmDCA). For RLS we also study a model where the data matrix elements are real numbers, identically and independently generated from a Gaussian distribution; in this setting we analytically find that the distribution of the inferred interactions is Gaussian. For data of Boolean type, more realistic in practice, the inferred interactions do not generally follow a Gaussian. However, extensive numerical simulations indicate that their distribution can be characterized by a single function determined by a few system parameters after normalization by the standard deviation. This property holds for both RLS and plmDCA and may be exploitable for inferring the distribution of extremely large interactions from simulations for smaller system sizes.