Statistical analysis of networks with community structure and bootstrap methods for big data

This dissertation is divided into two parts, concerning two areas of statistical methodology. The first part of this dissertation concerns statistical analysis of networks with community structure. The second part of this dissertation concerns bootstrap methods for big data. Statistical analysis of networks with community structure: Networks are ubiquitous in today’s world — network data appears from varied fields such as scientific studies, sociology, technology, social media and the Internet, to name a few. An interesting aspect of many real-world networks is the presence of community structure and the problem of detecting this community structure. In the first chapter, we consider heterogeneous networks which seems to have not been considered in the statistical community detection literature. We propose a blockmodel for heterogeneous networks with community structure, and introduce a heterogeneous spectral clustering algorithm for community detection in heterogeneous networks. Theoretical properties of the clustering algorithm under the proposed model are studied, along with simulation study and data analysis. A network feature that is closely associated with community structure is the popularity of nodes in different communities. Neither the classical stochastic blockmodel nor its degree-corrected extension can satisfactorily capture the dynamics of node popularity. In the second chapter, we propose a popularity-adjusted blockmodel for flexible modeling of node popularity. We establish consistency of likelihood modularity for community detection under the proposed model, and illustrate the improved empirical insights that can be gained through this methodology by analyzing the political blogs network Popularity is defined as the number of edges between a specific node and a specific community.

[1]  Purnamrita Sarkar,et al.  Hypothesis testing for automated community detection in networks , 2013, ArXiv.

[2]  Joseph P. Romano,et al.  The stationary bootstrap , 1994 .

[3]  R. Rao,et al.  Normal Approximation and Asymptotic Expansions , 1976 .

[4]  X. Shao,et al.  A general approach to the joint asymptotic analysis of statistics from sub-samples , 2013, 1305.5618.

[5]  H. White,et al.  STRUCTURAL EQUIVALENCE OF INDIVIDUALS IN SOCIAL NETWORKS , 1977 .

[6]  D. Radulovic The bootstrap for empirical processes based on stationary observations , 1996 .

[7]  Joseph P. Romano,et al.  Nonparametric Resampling for Homogeneous Strong Mixing Random Fields , 1993 .

[8]  M. Sherman Variance Estimation for Statistics Computed from Spatial Lattice Data , 1996 .

[9]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[10]  Patrick Richard Modified fast double sieve bootstraps for ADF tests , 2009, Comput. Stat. Data Anal..

[11]  P. Hall,et al.  Double-bootstrap methods that use a single double-bootstrap simulation , 2014, 1408.6327.

[12]  D. Rubin The Bayesian Bootstrap , 1981 .

[13]  Nicholas M. Kiefer,et al.  A NEW ASYMPTOTIC THEORY FOR HETEROSKEDASTICITY-AUTOCORRELATION ROBUST TESTS , 2005, Econometric Theory.

[14]  Noel A Cressie,et al.  Prediction of spatial cumulative distribution functions using subsampling , 1999 .

[15]  Raffaella Giacomini,et al.  A WARP-SPEED METHOD FOR CONDUCTING MONTE CARLO EXPERIMENTS INVOLVING BOOTSTRAP ESTIMATORS , 2013, Econometric Theory.

[16]  Daniel J. Nordman,et al.  On optimal spatial subsample size for variance estimation , 2002 .

[17]  P. Bühlmann,et al.  Block length selection in the bootstrap for time series , 1999 .

[18]  M. A. Arcones,et al.  Central limit theorems for empirical andU-processes of stationary mixing sequences , 1994 .

[19]  S. N. Lahiri,et al.  Asymptotic distribution of the empirical spatial cumulative distribution function predictor and prediction bands based on a subsampling method , 1999 .

[20]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[21]  Can M. Le,et al.  Optimization via Low-rank Approximation for Community Detection in Networks , 2014 .

[22]  Nicholas M. Kiefer,et al.  Simple Robust Testing of Regression Hypotheses , 2000 .

[23]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[24]  Paul Erdös,et al.  On random graphs, I , 1959 .

[25]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[26]  Ji Zhu,et al.  Consistency of community detection in networks under degree-corrected stochastic block models , 2011, 1110.3854.

[27]  P. Hall,et al.  On blocking rules for the bootstrap with dependent data , 1995 .

[28]  ON SAMPLE REUSE METHODS FOR SPATIAL DATA , 1997 .

[29]  Edward Carlstein,et al.  Nonparametric Estimation of the Moments of a General Statistic Computed from Spatial Data , 1994 .

[30]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  M. Peligrad,et al.  ON THE BLOCKWISE BOOTSTRAP FOR EMPIRICAL PROCESSES FOR STATIONARY SEQUENCES , 1998 .

[32]  YU BIN,et al.  IMPACT OF REGULARIZATION ON SPECTRAL CLUSTERING , 2016 .

[33]  Another look at the disjoint blocks bootstrap , 2009 .

[34]  Yang Yaning APPROXIMATING THE DISTRIBUTION OF M-ESTIMATORS IN LINEAR MODELS BY RANDOMLY WEIGHTED BOOTSTRAP , 2008 .

[35]  Tai Qin,et al.  Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel , 2013, NIPS.

[36]  P. Bühlmann The blockwise bootstrap for general empirical processes of stationary sequences , 1995 .

[37]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[38]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[39]  N. Ahlgren,et al.  Bootstrap and fast double bootstrap tests of cointegration rank with financial time series , 2008, Comput. Stat. Data Anal..

[40]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[41]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[42]  Improving the bandwidth-free inference methods by prewhitening , 2013 .

[43]  Holger Dette,et al.  Quantile Spectral Processes: Asymptotic Analysis and Inference , 2014, 1401.8104.

[44]  Paul A. Bates,et al.  Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis , 2006, BMC Bioinformatics.

[45]  R. Beran Prepivoting Test Statistics: A Bootstrap View of Asymptotic Refinements , 1988 .

[46]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[47]  Jun Zhu,et al.  Resampling methods for spatial regression models under a class of stochastic designs , 2006, math/0611261.

[48]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[49]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[50]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[51]  Xiaofeng Shao,et al.  Fixed b subsampling and the block bootstrap: improved confidence sets based on p‐value calibration , 2013 .

[52]  James G. MacKinnon,et al.  Improving the reliability of bootstrap tests with the fast double bootstrap , 2007, Comput. Stat. Data Anal..

[53]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[54]  S. M. Samuels On the Number of Successes in Independent Trials , 1965 .

[55]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[56]  U. V. Naik-Nimbalkar,et al.  Validity of blockwise bootstrap for empirical processes with stationary observations , 1994 .

[57]  Derek Greene,et al.  Producing a unified graph representation from multiple social network views , 2013, WebSci.

[58]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[59]  James G. MacKinnon,et al.  Improving the Reliability of Bootstrap Tests , 2000 .

[60]  Dragan Radulovic,et al.  On the Bootstrap and Empirical Processes for Dependent Sequences , 2002 .

[61]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Extended tapered block bootstrap , 2010 .

[63]  Fan Chung Graham,et al.  Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model , 2012, COLT.

[64]  J. Shao,et al.  The jackknife and bootstrap , 1996 .

[65]  Regina Y. Liu Moving blocks jackknife and bootstrap capture weak dependence , 1992 .

[66]  Nikolay Laptev BOOT-TS : A Scalable Bootstrap for Massive Time-Series Data , 2012 .

[67]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[68]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[69]  Soumendra N. Lahiri,et al.  Central limit theorems for weighted sums of a spatial process under a class of stochastic and fixed designs , 2003 .

[70]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[71]  Yizhou Sun,et al.  Graph Regularized Transductive Classification on Heterogeneous Information Networks , 2010, ECML/PKDD.

[72]  Dimitris N. Politis,et al.  Moment estimation for statistics from marked point processes , 2001 .

[73]  Donald W. K. Andrews,et al.  An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator , 1992 .

[74]  Franck Picard,et al.  A mixture model for random graphs , 2008, Stat. Comput..

[75]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[76]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[77]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[78]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[79]  Yizhou Sun,et al.  Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models , 2009, NIPS.

[80]  J. MacKinnon,et al.  FAST DOUBLE BOOTSTRAP TESTS OF NONNESTED LINEAR REGRESSION MODELS , 2002 .

[81]  Katharine Hayhoe,et al.  Testing the structural stability of temporally dependent functional observations and application to climate projections , 2011 .

[82]  H. White,et al.  A Reality Check for Data Snooping , 2000 .

[83]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[84]  Kanchan Mukherjee,et al.  Asymptotic distributions of M-estimators in a spatial regression model under some fixed and stochastic spatial sampling designs , 2004 .

[85]  Edoardo M. Airoldi,et al.  A Survey of Statistical Network Models , 2009, Found. Trends Mach. Learn..

[86]  Piotr Kokoszka,et al.  Detecting changes in the mean of functional observations , 2009 .

[87]  Jiashun Jin,et al.  Fast network community detection by SCORE , 2012, ArXiv.

[88]  Peter Hall Resampling a coverage pattern , 1985 .

[89]  X. Shao,et al.  The Dependent Wild Bootstrap , 2010 .

[90]  P. Bühlmann Blockwise Bootstrapped Empirical Process for Stationary Sequences , 1994 .

[91]  Efstathios Paparoditis,et al.  LARGE SAMPLE INFERENCE FOR IRREGULARLY SPACED DEPENDENT OBSERVATIONS BASEDON SUBSAMPLING , 1998 .

[92]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.