Linear regression and its inference on noisy network-linked data

Linear regression on a set of observations linked by a network has been an essential tool in modeling the relationship between response and covariates with additional network data. Despite its wide range of applications in many areas, such as social sciences and health-related research, the problem has not been well-studied in statistics so far. Previous methods either lack inference tools or rely on restrictive assumptions on social effects, and usually assume that networks are observed without errors, which is too good to be true in many problems. In this paper, we propose a linear regression model with nonparametric network effects. Our model does not assume that the relational data or network structure is exactly observed; thus, the method can be provably robust to a certain level of perturbation of the network structure. We establish a set of asymptotic inference results under a general requirement of the network perturbation and then study the robustness of our method in the specific setting when the perturbation comes from random network models. We discover a phase-transition phenomenon of inference validity concerning the network density when no prior knowledge about the network model is available, while also show the significant improvement achieved by knowing the network model. A by-product of our analysis is a rate-optimal concentration bound about subspace projection that may be of independent interest. We conduct extensive simulation studies to verify our theoretical observations, and demonstrate the advantage of our method over a few benchmarks in terms of accuracy and computational efficiency under different data-generating models. The method is then applied to adolescent network data to study gender and racial difference in social activities.

[1]  Tengyao Wang,et al.  A useful variant of the Davis--Kahan theorem for statisticians , 2014, 1405.0680.

[2]  Xiaodong Liu,et al.  Specification and Estimation of Social Interaction Models with Network Structures , 2010 .

[3]  Chao Gao,et al.  Community Detection in Degree-Corrected Block Models , 2016, The Annals of Statistics.

[4]  S. Chatterjee,et al.  Matrix estimation by Universal Singular Value Thresholding , 2012, 1212.1247.

[5]  Karl Rohe A critical threshold for design effects in network sampling , 2019, The Annals of Statistics.

[6]  Chao Gao,et al.  Achieving Optimal Misclassification Proportion in Stochastic Block Models , 2015, J. Mach. Learn. Res..

[7]  Can M. Le,et al.  Estimating the number of communities in networks by spectral methods , 2015, ArXiv.

[8]  M. E. J. Newman,et al.  Estimating network structure from unreliable measurements , 2018, Physical Review E.

[9]  Jianqing Fan,et al.  ENTRYWISE EIGENVECTOR ANALYSIS OF RANDOM MATRICES WITH LOW EXPECTED RANK. , 2017, Annals of statistics.

[10]  Wenbin Lu,et al.  Testing and Estimation of Social Network Dependence With Time to Event Data , 2019, Journal of the American Statistical Association.

[11]  Purnamrita Sarkar,et al.  Hierarchical community detection by recursive bi-partitioning , 2018 .

[12]  Mark S Handcock,et al.  MODELING SOCIAL NETWORKS FROM SAMPLED DATA. , 2010, The annals of applied statistics.

[13]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[14]  A. Rinaldo,et al.  Consistency of spectral clustering in stochastic block models , 2013, 1312.2050.

[15]  Jiashun Jin,et al.  Coauthorship and Citation Networks for Statisticians , 2014, ArXiv.

[16]  Elena Manresa,et al.  Estimating the Structure of Social Interactions Using Panel Data , 2013 .

[17]  Anup Rao,et al.  Stochastic Block Model and Community Detection in Sparse Graphs: A spectral algorithm with optimal rate of recovery , 2015, COLT.

[18]  Jing Lei,et al.  Generic Sample Splitting for Refined Community Recovery in Degree Corrected Stochastic Block Models , 2016 .

[19]  David M Levinson,et al.  Measuring the Structure of Road Networks , 2007 .

[20]  P. West,et al.  Peer pressure to smoke: the meaning depends on the method , 1996 .

[21]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[22]  E. Levina,et al.  Network cross-validation by edge sampling , 2016, Biometrika.

[23]  Daniela Witten,et al.  Testing for association in multiview network data , 2019, Biometrics.

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Lung-fei Lee,et al.  A Social Interactions Model with Endogenous Friendship Formation and , 2016 .

[26]  Tianxi Li,et al.  High-dimensional Gaussian graphical model for network-linked data , 2019, J. Mach. Learn. Res..

[27]  R. Bhatia Matrix Analysis , 1996 .

[28]  Peng Xie,et al.  Sampling biases in IP topology measurements , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[29]  Can M. Le,et al.  Concentration and regularization of random graphs , 2015, Random Struct. Algorithms.

[30]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[31]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[32]  Lung-fei Lee,et al.  Identification and estimation of econometric models with group interactions, contextual factors and fixed effects , 2007 .

[33]  C. Sherbourne The role of social support and life stress events in use of mental health services. , 1988, Social science & medicine.

[34]  James Moody,et al.  Peer influence groups: identifying dense clusters in large networks , 2001, Soc. Networks.

[35]  Ji Zhu,et al.  Link Prediction for Egocentrically Sampled Networks , 2018, J. Comput. Graph. Stat..

[36]  Béla Bollobás,et al.  The phase transition in inhomogeneous random graphs , 2007, Random Struct. Algorithms.

[37]  Yuewen Liu,et al.  Network Vector Autoregression , 2016 .

[38]  Lihua Lei Unified $\ell_{2\rightarrow\infty}$ Eigenspace Perturbation Theory for Symmetric Random Matrices , 2019 .

[39]  Yu-Xiang Wang,et al.  Graph Sparsification Approaches for Laplacian Smoothing , 2016, AISTATS.

[40]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[41]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[42]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[43]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[44]  Cosma Rohilla Shalizi,et al.  Homophily and Contagion Are Generically Confounded in Observational Social Network Studies , 2010, Sociological methods & research.

[45]  Giacomo De Giorgi,et al.  Identification of Social Interactions through Partially Overlapping Peer Groups , 2010 .

[46]  A. Simons,et al.  Life events, number of social relationships, and twelve‐month naturalistic course of major depression in a community sample of women , 2002, Depression and anxiety.

[47]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[48]  Ali Shojaie,et al.  A significance test for graph‐constrained estimation , 2015, Biometrics.

[49]  Jiashun Jin,et al.  FAST COMMUNITY DETECTION BY SCORE , 2012, 1211.5803.

[50]  Ji Zhu,et al.  On Consistency of Community Detection in Networks , 2011, ArXiv.

[51]  Can M. Le,et al.  Estimating a network from multiple noisy realizations , 2017, ArXiv.

[52]  Emmanuel Abbe,et al.  Community detection and stochastic block models: recent developments , 2017, Found. Trends Commun. Inf. Theory.

[53]  Purnamrita Sarkar,et al.  Estimating Mixed Memberships With Sharp Eigenvector Deviations , 2017, Journal of the American Statistical Association.

[54]  Ji Zhu,et al.  Consistency of community detection in networks under degree-corrected stochastic block models , 2011, 1110.3854.

[55]  Dong Xia Normal approximation and confidence region of singular subspaces , 2021, Electronic Journal of Statistics.

[56]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[57]  Dong Xia Data-dependent Confidence Regions of Singular Subspaces , 2019, ArXiv.

[58]  Xiaodong Li,et al.  Convexified Modularity Maximization for Degree-corrected Stochastic Block Models , 2015, The Annals of Statistics.

[59]  C. Priebe,et al.  Universally consistent vertex classification for latent positions graphs , 2012, 1212.1182.

[60]  Jing Lei,et al.  Network Cross-Validation for Determining the Number of Communities in Network Data , 2014, 1411.1715.

[61]  Elizabeth L. Ogburn Challenges to Estimating Contagion Effects from Observational Data , 2017, 1706.08440.

[62]  Bernard Fortin,et al.  Identification of Peer Effects through Social Networks , 2007, SSRN Electronic Journal.

[63]  Z. Fan,et al.  APPROXIMATE ` 0-PENALIZED ESTIMATION OF PIECEWISE-CONSTANT SIGNALS ON GRAPHS By , 2017 .

[64]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[65]  Carter T. Butts,et al.  Network inference, error, and informant (in)accuracy: a Bayesian approach , 2003, Soc. Networks.

[66]  Matthew O. Jackson,et al.  Relating Network Structure to Diffusion Properties through Stochastic Dominance , 2007 .

[67]  Akshay Krishnamurthy,et al.  Detecting Activations over Graphs using Spanning Tree Wavelet Bases , 2012, AISTATS.

[69]  P. Bickel,et al.  Likelihood-based model selection for stochastic block models , 2015, 1502.02069.

[70]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[71]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[72]  Purnamrita Sarkar,et al.  Hypothesis testing for automated community detection in networks , 2013, ArXiv.

[73]  E. Levina,et al.  Prediction models for network-linked data , 2016, The Annals of Applied Statistics.

[74]  Bridget E. Begg,et al.  A Proteome-Scale Map of the Human Interactome Network , 2014, Cell.

[75]  C. Priebe,et al.  The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics , 2017, The Annals of Statistics.

[76]  Alexander J. Smola,et al.  Trend Filtering on Graphs , 2014, J. Mach. Learn. Res..

[77]  Karl Rohe,et al.  Novel sampling design for respondent-driven sampling , 2016 .

[78]  C. Manski Identification of Endogenous Social Effects: The Reflection Problem , 1993 .

[79]  Cristopher Moore,et al.  Accuracy and scaling phenomena in Internet mapping. , 2004, Physical review letters.