Geometry of maximum likelihood estimation in Gaussian graphical models

Author(s): Uhler, Caroline | Advisor(s): Sturmfels, Bernd | Abstract: Algebraic statistics exploits the use of algebraic techniques to develop new paradigms and algorithms for data analysis. The development of computational algebra software provides a powerful tool to analyze statistical models. In Part I of this thesis, we use methods from computational algebra and algebraic geometry to study Gaussian graphical models. Algebraic methods have proven to be useful for statistical theory and applications alike. We describe a particular application to computational biology in Part II.Part I of this thesis investigates geometric aspects of maximum likelihood estimation in Gaussian graphical models. More generally, we study multivariate normal models that are described by linear constraints on the inverse of the covariance matrix. Maximum likelihood estimation for such models leads to the problem of maximizing the determinant function over a spectrahedron, and to the problem of characterizing the image of the positive definite cone under an arbitrary linear projection. In Chapter 2, we examine these problems at the interface of statistics and optimization from the perspective of convex algebraic geometry and characterize the cone of all sufficient statistics for which the maximum likelihood estimator (MLE) exists. In Chapter 3, we develop an algebraic elimination criterion, which allows us to find exact lower bounds on the number of observations needed to ensure that the MLE exists with probability one. This is applied to bipartite graphs, grids and colored graphs. We also present the first instance of a graph for which the MLE exists with probability one even when the number of observations equals the treewidth. Computational algebra software can be used to study graphs with a limited number of vertices and edges. In Chapter 4, we study the problem of existence of the MLE from an asymptotic point of view by fixing a class of graphs and letting the number of vertices grow to infinity. We prove that for very large cycles already two observations are sufficient for the existence of the MLE with probability one. Part II of this thesis describes an application of algebraic statistics to association studies. Rapid research progress in genotyping techniques have allowed large genome-wide association studies. Existing methods often focus on determining associations between single loci and a specific phenotype. However, a particular phenotype is usually the result of complex relationships between multiple loci and the environment. We develop a method for finding interacting genes (i.e. epistasis) using Markov bases. We test our method on simulated data and compare it to a two-stage logistic regression method and to a fully Bayesian method, showing that we are able to detect the interacting loci when other methods fail to do so. Finally, we apply our method to a genome-wide dog data set and identify epistasis associated with canine hair length.

[1]  Nicholas Eriksson,et al.  Polyhedral conditions for the nonexistence of the MLE for hierarchical log-linear models , 2006, J. Symb. Comput..

[2]  Stephen P. Boyd,et al.  Determinant Maximization with Linear Matrix Inequality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[3]  L. Pachter,et al.  Algebraic Statistics for Computational Biology: Preface , 2005 .

[4]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[5]  Monique Laurent,et al.  Matrix Completion Problems , 2009, Encyclopedia of Optimization.

[6]  Catherine André,et al.  Coat Variation in the Domestic Dog Is Governed by Variants in Three Genes , 2009, Science.

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  E. Ostrander,et al.  Single-Nucleotide-Polymorphism-Based Association Mapping of Dog Stereotypes , 2008, Genetics.

[9]  Michael I. Jordan Graphical Models , 2003 .

[10]  Donal O'Shea,et al.  Ideals, varieties, and algorithms - an introduction to computational algebraic geometry and commutative algebra (2. ed.) , 1997, Undergraduate texts in mathematics.

[11]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[12]  S. Sullivant,et al.  Trek separation for Gaussian graphical models , 2008, 0812.1938.

[13]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[14]  Bernd Sturmfels,et al.  Multivariate Gaussians, semidefinite matrix completion, and convex algebraic geometry , 2009, 0906.3529.

[15]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[16]  Charles R. Johnson,et al.  Positive definite completions of partial Hermitian matrices , 1984 .

[17]  Korbinian Strimmer,et al.  Learning Large‐Scale Graphical Gaussian Models from Genomic Data , 2005 .

[18]  Lawrence D. Brown Fundamentals of Statistical Exponential Families , 1987 .

[19]  A. Blaukat,et al.  Protein tyrosine kinase-mediated pathways in G protein-coupled receptor signaling , 2007, Cell Biochemistry and Biophysics.

[20]  Søren Højsgaard,et al.  Graphical Gaussian models with edge and vertex symmetries , 2008 .

[21]  Fred A. Wright,et al.  Genetics and population analysis Simulating association studies : a data-based resampling method for candidate regions or whole genome scans , 2007 .

[22]  T. Willmore Algebraic Geometry , 1973, Nature.

[23]  J. M. Smith,et al.  The hitch-hiking effect of a favourable gene. , 1974, Genetical research.

[24]  Basicness of Semialgebraic Sets , 1999 .

[25]  B. Kotzev Determinantal Ideals of Linear Type of a Generic Symmetric Matrix , 1991 .

[26]  Monique Laurent,et al.  On the Sparsity Order of a Graph and Its Deficiency in Chordality , 2001, Comb..

[27]  Søren Ladegaard Buhl On the Existence of Maximum Likelihood Estimators for Graphical Gaussian Models , 1993 .

[28]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[29]  Debbie S. Yuster,et al.  A complete classification of epistatic two-locus models , 2006, BMC Genetics.

[30]  B. Sturmfels,et al.  Combinatorial Commutative Algebra , 2004 .

[31]  K. Lindblad-Toh,et al.  Efficient mapping of mendelian traits in dogs through genome-wide association , 2007, Nature Genetics.

[32]  Seth Sullivant,et al.  Lectures on Algebraic Statistics , 2008 .

[33]  P. Diaconis,et al.  Algebraic algorithms for sampling from conditional distributions , 1998 .

[34]  J. Hein,et al.  Using biological networks to search for interacting loci in genome-wide association studies , 2009, European Journal of Human Genetics.

[35]  S. T. Jensen,et al.  Covariance Hypotheses Which are Linear in Both the Covariance and the Inverse Covariance , 1988 .

[36]  Seth Sullivant,et al.  Algebraic geometry of Gaussian Bayesian networks , 2007, Adv. Appl. Math..

[37]  W. Barrett,et al.  The real positive definite completion problem for a 4-cycle , 1993 .

[38]  M. Ronis,et al.  Agouti signaling protein stimulates cell division in "viable yellow" (A(vy)/a) mouse liver. , 2007, Experimental biology and medicine.

[39]  Steffen L. Lauritzen,et al.  Estimation of means in graphical Gaussian models with symmetries , 2011, 1101.3709.

[40]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[41]  G. Burnstock,et al.  Purinergic receptors are part of a signalling system for proliferation and differentiation in distinct cell lineages in human anagen hair follicles , 2008, Purinergic Signalling.

[42]  G. Ziegler Lectures on Polytopes , 1994 .

[43]  Jesús A. De Loera,et al.  The Central Curve in Linear Programming , 2010, Found. Comput. Math..

[44]  M. Purugganan,et al.  The Extent of Linkage Disequilibrium in Rice (Oryza sativa L.) , 2007, Genetics.

[45]  Bernd Sturmfels,et al.  Algebraic geometry of Bayesian networks , 2005, J. Symb. Comput..

[46]  Charles R. Johnson,et al.  The Real Positive Definite Completion Problem: Cycle Completability , 1996 .

[47]  Kalpathi R. Subramanian,et al.  Interactive Analysis of Gene Interactions Using Graphical gaussian model , 2003, BIOKDD.

[48]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[49]  M. Goddard,et al.  Mapping genes for complex traits in domestic animals and their use in breeding programmes , 2009, Nature Reviews Genetics.

[50]  T. Schlake,et al.  Igf-I signalling controls the hair growth cycle and the differentiation of hair shafts. , 2005, The Journal of investigative dermatology.

[51]  J. Davenport Editor , 1960 .

[52]  B. Sturmfels Gröbner bases and convex polytopes , 1995 .

[53]  L. Rodman,et al.  Positive semidefinite matrices with a given sparsity pattern , 1988 .

[54]  E. Ostrander,et al.  Lessons learned from the dog genome. , 2007, Trends in genetics : TIG.

[55]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[56]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[58]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[59]  G. P. Frets Heredity of headform in man , 1921, Genetica.

[60]  Judy H Cho,et al.  Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease , 2008, Nature Genetics.

[61]  E. Kirkness,et al.  Extensive and breed-specific linkage disequilibrium in Canis familiaris. , 2004, Genome research.

[62]  C. Richard Johnson,et al.  Matrix Completion Problems: A Survey , 1990 .

[63]  W. Vasconcelos,et al.  Ideals with sliding depth , 1985, Nagoya Mathematical Journal.

[64]  Algebras Generated by Reciprocals of Linear Forms , 2001, math/0105095.

[65]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[66]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[67]  N. L. Johnson,et al.  Continuous Multivariate Distributions, Volume 1: Models and Applications , 2019 .

[68]  S. Fienberg An Iterative Procedure for Estimation in Contingency Tables , 1970 .

[69]  T. Zaslavsky Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes , 1975 .

[70]  T. Hansen,et al.  A Bayesian Multilocus Association Method: Allowing for Higher-Order Interaction in Association Studies , 2007, Genetics.

[71]  P. Białas,et al.  Science of Complex Networks: From Biology to the Internet and WWW , 2005 .

[72]  Alexander Barvinok,et al.  A course in convexity , 2002, Graduate studies in mathematics.

[73]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[74]  Tom Brylawski,et al.  A combinatorial model for series-parallel networks , 1971 .

[75]  J. Stückrad On quasi-complete intersections , 1992 .

[76]  Bernd Sturmfels,et al.  The algebraic degree of semidefinite programming , 2010, Math. Program..