High Dimensional Modelling

This chapter describes methods suitable for high-dimensional graphical modeling. Recent years have seen intense interest in applying graphical modeling techniques to data of high dimension: by this we mean from hundreds to tens of thousands of variables. Such data arise routinely in fields such as molecular biology. We first describe two typical datasets: one from a study of gene expression in breast cancer patients, and the other from the HapMap project, in which a large number of genomic markers and gene expression measurements are recorded for 90 individuals. We compare the computational efficiency of some model selection algorithms, as applied to one of the example datasets. Of these, an extension of the Chow-Liu algorithm to find the minimal BIC forest, implemented in the gRapHD package, is found to be most efficient. Also the glasso algorithm and a stepwise decomposable search algorithm are highly efficient. We describe these algorithms in more detail and illustrate their use on the example datasets. Finally, as a more advanced example, we illustrate how a Bayesian equivalent to the minimal BIC forest algorithm for high-dimensional discrete data may be obtained. Assuming a hyper-Dirichlet prior, the maximum a posteriori forest is derived by using the extended Chow-Liu algorithm with appropriate user-defined edge weights. This is illustrated using a subset of the HapMap data.