Algebraic Statistics

B asic data-driven scientific discovery rests on our ability to answer the first question one might ask a statistician: What model best fits the observed data? At the center of this question is a fundamental statistical problem of hypothesis testing. A test consists of proposing a model, calculating some test statistic, and using its theoretical properties to determine whether to reject the model hypothesis or not. Traditional tests assume what are usually referred to as mild conditions: large sample size, model smoothness or regularity. On the other hand, the development of data-acquisition technologies has produced many types of data sets, such as social networks or food webs, which may be available only as very sparse, possibly large, single-sample data. In turn, this has motivated the development of new statistical models that can capture the rich structure of such data but are necessarily more complex and, in fact, non-regular. The regularity and large-sample assumptions are violated. Data-driven science is then faced with two obvious obstacles. First, if only a single observation (or a small sample) of, say, a network is observed, the well-known hypothesis tests relying on asymptotic methods do not apply. How does a social scientist test how cities grow and evolve as networks if network model-testing tools are not available for her type of data and hypotheses? Second, if genetic mutation models used in biology are extensive and rich enough, there is no theory that justifies their use by estimating parameters by hill-climbing algorithms since it can be shown that parameters are not identifiable in such models, thus leading to incorrect results. How does a researcher in phylogenetics know if she has correctly determined the genetic relationship between species from their DNA sequences if there is no theory to show that the computational method is reliable? My research in algebraic statistics addresses two fundamental problems: how to extend the hypothesis testing methodology to sparse categorical data and how to bypass parameter estimation issues such as non-identifiability or multimodal likelihood functions. Both translate to algebraic, geometric, and combinatorial complexity properties of statistical models. My main, broad research objective is the integration of the fields of statistics, algebraic geometry, and combinatorics with focus on providing formalism to an interdisciplinary approach to data analysis and modeling. I am currently working on random graphs and network modeling with the goal of understanding what social-network-type data are telling us about the world around us.