Asymptotic Seed Bias in Respondent-driven Sampling

Respondent-driven sampling (RDS) collects a sample of individuals in a networked population by incentivizing the sampled individuals to refer their contacts into the sample. This iterative process is initialized from some seed node(s). Sometimes, this selection creates a large amount of seed bias. Other times, the seed bias is small. This paper gains a deeper understanding of this bias by characterizing its effect on the limiting distribution of various RDS estimators. Using classical tools and results from multi-type branching processes (Kesten and Stigum, 1966), we show that the seed bias is negligible for the Generalized Least Squares (GLS) estimator and non-negligible for both the inverse probability weighted and Volz-Heckathorn (VH) estimators. In particular, we show that (i) above a critical threshold, VH converge to a non-trivial mixture distribution, where the mixture component depends on the seed node, and the mixture distribution is possibly multi-modal. Moreover, (ii) GLS converges to a Gaussian distribution independent of the seed node, under a certain condition on the Markov process. Numerical experiments with both simulated data and empirical social networks suggest that these results appear to hold beyond the Markov conditions of the theorems.

[1]  Matthew J. Salganik,et al.  Strengthening the Reporting of Observational Studies in Epidemiology for respondent-driven sampling studies: “STROBE-RDS” statement , 2015, Journal of clinical epidemiology.

[2]  G. Gurtner,et al.  Statistics in medicine. , 2011, Plastic and reconstructive surgery.

[3]  R. Durrett Probability: Theory and Examples , 1993 .

[4]  Yuval Peres,et al.  Markov chains indexed by trees , 1994 .

[5]  S. Boorman,et al.  Social structure from multiple networks: I , 1976 .

[6]  Karl Rohe,et al.  Network driven sampling; a critical threshold for design effects , 2015, 1505.05461.

[7]  Sébastien Roch,et al.  Generalized least squares can overcome the critical threshold in respondent-driven sampling , 2018, Proceedings of the National Academy of Sciences.

[8]  Tyler H McCormick,et al.  Estimating uncertainty in respondent-driven sampling using a tree bootstrap method , 2016, Proceedings of the National Academy of Sciences.

[9]  Mikko Alava,et al.  Branching Processes , 2009, Encyclopedia of Complexity and Systems Science.

[10]  Matthew J. Salganik,et al.  Respondent‐driven sampling as Markov chain Monte Carlo , 2009, Statistics in medicine.

[11]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[12]  Douglas D. Heckathorn,et al.  Respondent-driven sampling : A new approach to the study of hidden populations , 1997 .

[13]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[14]  Erik M. Volz,et al.  Probability based estimation theory for respondent driven sampling , 2008 .

[15]  Karl Rohe,et al.  Central limit theorems for network driven sampling , 2015, 1509.04704.

[16]  Karl Rohe A critical threshold for design effects in network sampling , 2019, The Annals of Statistics.

[17]  Matthew J. Salganik,et al.  5. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling , 2004 .

[18]  S. Boorman,et al.  Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions , 1976, American Journal of Sociology.

[19]  H. Kesten,et al.  Additional Limit Theorems for Indecomposable Multidimensional Galton-Watson Processes , 1966 .

[20]  S. Boorman,et al.  Social Structure from Multiple Networks. II. Role Structures , 1976, American Journal of Sociology.

[21]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .