BOOTSTRAP INFERENCE FOR NETWORK CONSTRUCTION WITH AN APPLICATION TO A BREAST CANCER MICROARRAY STUDY.

Gaussian Graphical Models (GGMs) have been used to construct genetic regulatory networks where regularization techniques are widely used since the network inference usually falls into a high-dimension-low-sample-size scenario. Yet, finding the right amount of regularization can be challenging, especially in an unsupervised setting where traditional methods such as BIC or cross-validation often do not work well. In this paper, we propose a new method - Bootstrap Inference for Network COnstruction (BINCO) - to infer networks by directly controlling the false discovery rates (FDRs) of the selected edges. This method fits a mixture model for the distribution of edge selection frequencies to estimate the FDRs, where the selection frequencies are calculated via model aggregation. This method is applicable to a wide range of applications beyond network construction. When we applied our proposed method to building a gene regulatory network with microarray expression breast cancer data, we were able to identify high-confidence edges and well-connected hub genes that could potentially play important roles in understanding the underlying biological processes of breast cancer.

[1]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[2]  Aeilko H Zwinderman,et al.  Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers , 2007, BMC proceedings.

[3]  H. Callen,et al.  ANNUAL SUMMARY REPORT , 1965 .

[4]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[5]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[6]  Pei Wang,et al.  Partial Correlation Estimation by Joint Sparse Regression Models , 2008, Journal of the American Statistical Association.

[7]  David A. Freedman,et al.  A Remark on the Difference between Sampling with and without Replacement , 1977 .

[8]  Lionel Domenjoud,et al.  Damaged DNA Binding Protein 2 Plays a Role in Breast Cancer Cell Growth , 2008, PloS one.

[9]  R Tibshirani,et al.  Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification , 2006, Oncogene.

[10]  J. Bergh,et al.  Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[11]  Fatih Aydogan,et al.  CCND1 and CDKN1B polymorphisms and risk of breast cancer. , 2010, Anticancer research.

[12]  Jiayuh Lin,et al.  Evaluation of potential Stat3-regulated genes in human breast cancer. , 2005, Biochemical and biophysical research communications.

[13]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[14]  J. Friedman Regularized Discriminant Analysis , 1989 .

[15]  Gang Wu,et al.  Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations. , 2006, Biochemical and biophysical research communications.

[16]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[17]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[18]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[19]  Hongbing Shen,et al.  Variant genotypes of CDKN1A and CDKN1B are associated with an increased risk of breast cancer in Chinese women , 2006, International journal of cancer.

[20]  Carlos Matrán,et al.  On the unconditional strong law of large numbers for the bootstrap mean , 1996 .

[21]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[22]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[23]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[24]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[25]  Korbinian Strimmer,et al.  Learning Large‐Scale Graphical Gaussian Models from Genomic Data , 2005 .

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  Le Song,et al.  Estimating time-varying networks , 2008, ISMB 2008.

[28]  J. Hasty,et al.  Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[30]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[31]  Stuart A. Aaronson,et al.  Overexpression of Kinase-Associated Phosphatase (KAP) in Breast and Prostate Cancer and Inhibition of the Transformed Phenotype by Antisense KAP Expression , 2000, Molecular and Cellular Biology.

[32]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[33]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[36]  Natallia Katenka,et al.  Multi-Attribute Networks and the Impact of Partial Information on Inference and Characterization , 2011, ArXiv.

[37]  Gary L. Johnson,et al.  Signaling by ErbB Receptors in Breast Cancer: Regulation by Compartmentization of Heterodimeric Receptor Complexes , 2000 .

[38]  Natallia Katenka,et al.  Inference and Characterization of Multi-Attribute Networks with Application to Computational Biology , 2011, 1109.3160.

[39]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[40]  John T. Wei,et al.  Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. , 2005, Cancer cell.

[41]  D. Seldin,et al.  Roles of IKK kinases and protein kinase CK2 in activation of nuclear factor-kappaB in breast cancer. , 2001, Cancer research.

[42]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[43]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[44]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[45]  Ling Tian,et al.  Stat3 and CCAAT/enhancer binding protein beta (C/EBP-beta) regulate Jab1/CSN5 expression in mammary carcinoma cells , 2011, Breast Cancer Research.

[46]  F. Clavel-Chapelon,et al.  Common variants near TARDBP and EGR2 are associated with susceptibility to Ewing sarcoma , 2012, Nature Genetics.

[47]  R. Roeder,et al.  Key roles for MED1 LxxLL motifs in pubertal mammary gland development and luminal-cell differentiation , 2010, Proceedings of the National Academy of Sciences.

[48]  Yan Song,et al.  Interaction between BRCA1/BRCA2 and ATM/ATR associate with breast cancer susceptibility in a Chinese Han population. , 2010, Cancer genetics and cytogenetics.

[49]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[50]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[51]  Hongzhe Li,et al.  Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. , 2006, Biostatistics.

[52]  H. Kreipe,et al.  Legumain expression as a prognostic factor in breast cancer patients , 2007, Breast Cancer Research and Treatment.

[53]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[54]  R. Espinosa,et al.  Amplification and overexpression of peroxisome proliferator-activated receptor binding protein (PBP/PPARBP) gene in breast cancer. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Nanny Wermuth,et al.  Multivariate Dependencies: Models, Analysis and Interpretation , 1996 .

[56]  Barbara Hoffman,et al.  Gadd45a suppresses Ras-driven mammary tumorigenesis by activation of c-Jun NH2-terminal kinase and p38 stress signaling resulting in apoptosis and senescence. , 2006, Cancer research.

[57]  J WainwrightMartin Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso) , 2009 .

[58]  J. Collins,et al.  Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling , 2003, Science.

[59]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[60]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[61]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[62]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.