Inverse finite-size scaling for high-dimensional significance analysis.

We propose an efficient procedure for significance determination in high-dimensional dependence learning based on surrogate data testing, termed inverse finite-size scaling (IFSS). The IFSS method is based on our discovery of a universal scaling property of random matrices which enables inference about signal behavior from much smaller scale surrogate data than the dimensionality of the original data. As a motivating example, we demonstrate the procedure for ultra-high-dimensional Potts models with order of 10^{10} parameters. IFSS reduces the computational effort of the data-testing procedure by several orders of magnitude, making it very efficient for practical purposes. This approach thus holds considerable potential for generalization to other types of complex models.

[1]  Marcin J. Skwark,et al.  Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis , 2016, bioRxiv.

[2]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[3]  R. Zecchina,et al.  Inverse statistical problems: from the inverse Ising problem to data science , 2017, 1702.01522.

[4]  P. Grassberger Do climatic attractors exist? , 1986, Nature.

[5]  W. M. Wood-Vasey,et al.  LIKELIHOOD-FREE COSMOLOGICAL INFERENCE WITH TYPE Ia SUPERNOVAE: APPROXIMATE BAYESIAN COMPUTATION FOR A COMPLETE TREATMENT OF UNCERTAINTY , 2012, 1206.2563.

[6]  Andrea Montanari,et al.  Computational Implications of Reducing Data to Sufficient Statistics , 2014, ArXiv.

[7]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[8]  James Theiler,et al.  Testing for nonlinearity in time series: the method of surrogate data , 1992 .

[9]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[11]  M. Gutmann,et al.  Fundamentals and Recent Developments in Approximate Bayesian Computation , 2016, Systematic biology.

[12]  Magnus Ekeberg,et al.  Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences , 2014, J. Comput. Phys..

[13]  Jukka Corander,et al.  SuperDCA for genome-wide epistasis analysis , 2017, bioRxiv.

[14]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[15]  Anthony N. Pettitt,et al.  Bayesian indirect inference using a parametric auxiliary model , 2015, 1505.03372.

[16]  E. E. O. Ishida,et al.  cosmoabc: Likelihood-free inference via Population Monte Carlo Approximate Bayesian Computation , 2015, Astron. Comput..

[17]  A. N. Pettitt,et al.  Approximate Bayesian Computation for astronomical model analysis: a case study in galaxy demographics and morphological transformation at high redshift , 2012, 1202.1426.