Centroid estimation in discrete high-dimensional spaces with applications in biology

Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.

[1]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[2]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[3]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[4]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[5]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[6]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[8]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[9]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[10]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[11]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[12]  E. al.,et al.  The Sloan Digital Sky Survey: Technical summary , 2000, astro-ph/0006396.

[13]  Tommi S. Jaakkola,et al.  Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models , 2001, Pacific Symposium on Biocomputing.

[14]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[15]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[16]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[17]  D. Mathews Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. , 2004, RNA.

[18]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  C. Lawrence,et al.  RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. , 2005, RNA.

[20]  P. Bradley,et al.  Toward High-Resolution de Novo Structure Prediction for Small Proteins , 2005, Science.

[21]  Divyakant Agrawal,et al.  Using Association Rules for Fraud Detection in Web Advertising Networks , 2005, VLDB.

[22]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[23]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[24]  G. Casella,et al.  Objective Bayesian Variable Selection , 2006 .

[25]  C. Lawrence,et al.  Clustering of RNA secondary structures with application to messenger RNAs. , 2006, Journal of molecular biology.

[26]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[27]  Jinfeng Zhang,et al.  On Side-Chain Conformational Entropy of Proteins , 2006, PLoS Comput. Biol..

[28]  L. Fahrmeir,et al.  Spatial Bayesian Variable Selection With Application to Functional Magnetic Resonance Imaging , 2007 .

[29]  Lee Aaron Newberg,et al.  A phylogenetic Gibbs sampler that yields centroid solutions for cis-regulatory site prediction , 2007, Bioinform..

[30]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.