Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests

BackgroundChow and Liu showed that the maximum likelihood tree for multivariate discrete distributions may be found using a maximum weight spanning tree algorithm, for example Kruskal's algorithm. The efficiency of the algorithm makes it tractable for high-dimensional problems.ResultsWe extend Chow and Liu's approach in two ways: first, to find the forest optimizing a penalized likelihood criterion, for example AIC or BIC, and second, to handle data with both discrete and Gaussian variables. We apply the approach to three datasets: two from gene expression studies and the third from a genetics of gene expression study. The minimal BIC forest supplements a conventional analysis of differential expression by providing a tentative network for the differentially expressed genes. In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.ConclusionsThe approach is generally useful as a preliminary step towards understanding the overall dependence structure of high-dimensional discrete and/or continuous data. Trees and forests are unrealistically simple models for biological systems, but can provide useful insights. Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.

[1]  M. Frydenberg,et al.  Decomposition of maximum likelihood in mixed graphical interaction models , 1989 .

[2]  D. Edwards Introduction to graphical modelling , 1995 .

[3]  Padhraic Smyth,et al.  Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series , 2004, UAI.

[4]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[5]  Nathan Srebro,et al.  Maximum likelihood bounded tree-width Markov networks , 2001, Artif. Intell..

[6]  Michael I. Jordan Graphical Models , 1998 .

[7]  N. Wermuth Model Search among Multiplicative Models , 1976 .

[8]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[9]  Robert M. Blumenthal,et al.  Activation from a Distance: Roles of Lrp and Integration Host Factor in Transcriptional Activation ofgltBDF , 2001, Journal of bacteriology.

[10]  D. Madigan,et al.  On the Markov Equivalence of Chain Graphs, Undirected Graphs, and Acyclic Digraphs , 1997 .

[11]  Marina Meila,et al.  An Accelerated Chow and Liu Algorithm: Fitting Tree Distributions to High-Dimensional Sparse Data , 1999, ICML.

[12]  Stan Matwin,et al.  A formal approach to using data distributions for building causal polytree structures , 2004, Inf. Sci..

[13]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[14]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[15]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[16]  Pedro Larrañaga,et al.  A Guide to the Literature on Inferring Genetic Networks by Probabilistic Graphical Models , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[17]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[18]  N. Camp,et al.  Graphical modeling of the joint distribution of alleles at associated loci. , 2004, American journal of human genetics.

[19]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[20]  E.S. Manolakos,et al.  A Graphical Model Formulation of the DNA Base-Calling Problem , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[21]  David Maxwell Chickering,et al.  Learning Bayesian Networks is NP-Complete , 2016, AISTATS.

[22]  G. W. Hatfield,et al.  Global Gene Expression Profiling in Escherichia coli K12 , 2003, Journal of Biological Chemistry.

[23]  Andrew W. Moore,et al.  Dependency trees in sub-linear time and bounded memory , 2006, The VLDB Journal.

[24]  Robert Castelo,et al.  Reverse Engineering Molecular Regulatory Networks from Microarray Data with qp-Graphs , 2009, J. Comput. Biol..

[25]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[26]  G. W. Hatfield,et al.  Global gene expression profiling in Escherichia coli K12. The effects of integration host factor. , 2000, The Journal of biological chemistry.

[27]  Nathan Srebro,et al.  Methods and Experiments With Bounded Tree-width Markov Networks , 2004 .

[28]  Terry J. Wagner,et al.  Consistency of an estimate of tree-dependent probability distributions (Corresp.) , 1973, IEEE Trans. Inf. Theory.

[29]  P. Hall,et al.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[30]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[31]  H. Akaike A new look at the statistical model identification , 1974 .

[32]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[33]  Concettina Guerra,et al.  A review on models and algorithms for motif discovery in protein-protein interaction networks. , 2008, Briefings in functional genomics & proteomics.

[34]  Doug Fisher,et al.  Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[35]  BMC Bioinformatics , 2005 .

[36]  Martin J. Wainwright,et al.  Embedded trees: estimation of Gaussian Processes on graphs with cycles , 2004, IEEE Transactions on Signal Processing.

[37]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[38]  Judea Pearl,et al.  Equivalence and Synthesis of Causal Models , 1990, UAI.

[39]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[40]  Christian L. Barrett,et al.  Genome-scale reconstruction of the Lrp regulatory network in Escherichia coli , 2008, Proceedings of the National Academy of Sciences.

[41]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[42]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[43]  Joaquín Dopazo,et al.  Data Analysis and Visualization in Genomics and Proteomics , 2005 .

[44]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[45]  Pierre Baldi,et al.  Global Gene Expression Profiling in Escherichia coliK12 , 2002, The Journal of Biological Chemistry.

[46]  Michael I. Jordan,et al.  Thin Junction Trees , 2001, NIPS.