A Two-Stage Penalized Least Squares Method for Constructing Large Systems of Structural Equations

We propose a two-stage penalized least squares method to build large systems of structural equations based on the instrumental variables view of the classical two-stage least squares method. We show that, with large numbers of endogenous and exogenous variables, the system can be constructed via consistent estimation of a set of conditional expectations at the first stage, and consistent selection of regulatory effects at the second stage. While the consistent estimation at the first stage can be obtained via the ridge regression, the adaptive lasso is employed at the second stage to achieve the consistent selection. The resultant estimates of regulatory effects enjoy the oracle properties. This method is computationally fast and allows for parallel implementation. We demonstrate its effectiveness via simulation studies and real data analysis.

[1]  H. Theil Estimation and Simultaneous Correlation in Complete Equation Systems , 1992 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[4]  Olav Reiersol,et al.  Confluence Analysis by Means of Lag Moments and Other Methods of Confluence Analysis , 1941 .

[5]  A. Belloni,et al.  SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN , 2012 .

[6]  Georgios B. Giannakis,et al.  Inference of Gene Regulatory Networks with Sparse Structural Equation Models Exploiting Genetic Perturbations , 2013, PLoS Comput. Biol..

[7]  M. Xiong,et al.  Identification of genetic networks. , 2004, Genetics.

[8]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[9]  Karl W. Broman,et al.  A model selection approach for the identification of quantitative trait loci in experimental crosses , 2002 .

[10]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[11]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[12]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[13]  E. Dermitzakis,et al.  From expression QTLs to personalized transcriptomics , 2011, Nature Reviews Genetics.

[14]  Ying Zhu,et al.  Sparse Linear Models and L1-Regularized 2SLS with High-Dimensional Endogenous Regressors and Instruments , 2013 .

[15]  Jian Huang,et al.  The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. , 2011, Annals of statistics.

[16]  B. Yandell,et al.  Inferring Causal Phenotype Networks From Segregating Populations , 2008, Genetics.

[17]  Shizhong Xu,et al.  Mapping Quantitative Trait Loci for Expression Abundance , 2007, Genetics.

[18]  H. Akaike A new look at the statistical model identification , 1974 .

[19]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[20]  A. Tsybakov,et al.  Sparse recovery under matrix uncertainty , 2008, 0812.2818.

[21]  L. Liang,et al.  A genome-wide association study of global gene expression , 2007, Nature Genetics.

[22]  Rachel B. Brem,et al.  The landscape of genetic complexity across 5,700 gene expression traits in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[23]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[24]  Benjamin A. Logsdon,et al.  Gene Expression Network Reconstruction by Convex Feature Selection when Incorporating Genetic Perturbations , 2010, PLoS Comput. Biol..

[25]  Steve Horvath,et al.  Using genetic markers to orient the edges in quantitative trait networks: The NEO software , 2008, BMC Systems Biology.

[26]  H. Theil,et al.  Economic Forecasts and Policy. , 1959 .

[27]  C. Geyer On the Asymptotics of Constrained $M$-Estimation , 1994 .

[28]  H. Theil,et al.  Economic Forecasts and Policy. , 1959 .

[29]  T. Haavelmo,et al.  The Probability Approach in Econometrics , 1944 .

[30]  A. G. de la Fuente,et al.  Gene Network Inference via Structural Equation Modeling in Genetical Genomics Experiments , 2008, Genetics.

[31]  Kenneth A. Bollen,et al.  An alternative two stage least squares (2SLS) estimator for latent variable equations , 1996 .

[32]  Hongzhe Li,et al.  Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics , 2013, Journal of the American Statistical Association.

[33]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[34]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[35]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[36]  C. Kendziorski,et al.  Statistical Methods for Expression Quantitative Trait Loci (eQTL) Mapping , 2006, Biometrics.

[37]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[38]  Bill Shipley,et al.  Cause and Correlation in Biology: A User''s Guide to Path Analysis , 2016 .

[39]  T. W. Anderson,et al.  Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations , 1949 .

[40]  Simone Fattorini,et al.  Cause and Correlation in Biology. A User's Guide to Path Analysis, Structural Equations and Causal Inference with R, Second edition, Bill Shipley. Cambridge University Press (2016), (ISBN: 978-1-107-44259-7, 314 pp., £39.99, paperback) , 2017 .

[41]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[42]  J. Ibrahim,et al.  Proximity Model for Expression Quantitative Trait Loci (eQTL) Detection , 2007, Biometrics.

[43]  Olav Reiersöl,et al.  Confluence analysis by means of instrumental sets of variables , 1945 .

[44]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[45]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[46]  R. L. Basmann A GENERALIZED CLASSICAL METHOD OF LINEAR ESTIMATION OF COEFFICIENTS IN A STRUCTURAL EQUATION , 1957 .

[47]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[48]  Peter E. Kennedy A Guide to Econometrics , 1979 .

[49]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[50]  T. Haavelmo The Statistical Implications of a System of Simultaneous Equations , 1943 .

[51]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[52]  MendesPedro,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004 .

[53]  J. Nap,et al.  Genetical genomics: the added value from segregation. , 2001, Trends in genetics : TIG.