Parsing Social Network Survey Data from Hidden Populations Using Stochastic Context-Free Grammars

Background Human populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a ÔhiddenÕ population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process. Methodology/Principal Findings Here, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits. Conclusions SCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.

[1]  M. Boily,et al.  Some Methodological Issues in the Study of Sexual Networks: From Model to Data to Model , 2000, Sexually transmitted diseases.

[2]  Susan C Seifert,et al.  Gauging the Informal Arts Sector Metropolitan Philadelphia, 2004 , 2005 .

[3]  S. Berg Snowball Sampling—I , 2006 .

[4]  C. Mirbel XXXVII. Memoir on the anatomy of vegetables. Read before the Physical Class of the Institute , 1802 .

[5]  Noam Chomsky,et al.  The Logical Structure of Linguistic Theory , 1975 .

[6]  Rebeca Ramos,et al.  Historical trends in the production and consumption of illicit drugs in Mexico: implications for the prevention of blood borne infections. , 2005, Drug and alcohol dependence.

[7]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[8]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[9]  Robert G Carlson,et al.  Respondent-driven sampling to recruit MDMA users: a methodological assessment. , 2005, Drug and alcohol dependence.

[10]  Violeta Andjelkovic,et al.  Exploring Barriers to ‘Respondent Driven Sampling’ in Sex Worker and Drug-Injecting Sex Worker Populations in Eastern Europe , 2006, Journal of Urban Health.

[11]  W. Fitch,et al.  Listening to viral tongues: comparing viral trees using a stochastic context-free grammar. , 2005, Molecular biology and evolution.

[12]  Joan Jeffri,et al.  Finding the beat: Using respondent-driven sampling to study jazz musicians☆ , 2001 .

[13]  L. Johnston,et al.  Assessment of Respondent Driven Sampling for Recruiting Female Sex Workers in Two Vietnamese Cities: Reaching the Unseen Sex Worker , 2006, Journal of Urban Health.

[14]  P. Biernacki,et al.  TARGETED SAMPLING: OPTIONS FOR THE STUDY OF HIDDEN POPULATIONS , 1989 .

[15]  D. Heckathorn,et al.  Extensions of Respondent-Driven Sampling: A New Approach to the Study of Injection Drug Users Aged 18–25 , 2002, AIDS and Behavior.

[16]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[17]  O. Levina,et al.  An Analysis of Respondent Driven Sampling with Injection Drug Users (IDU) in Albania and the Russian Federation , 2006, Journal of Urban Health.

[18]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[19]  Andrew Gelman,et al.  Struggles with survey weighting and regression modeling , 2007, 0710.5005.

[20]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[21]  Stephanie Tortu,et al.  Recruiting Injection Drug Users: A Three-Site Comparison of Results and Experiences with Respondent-Driven and Targeted Sampling Procedures , 2006, Journal of Urban Health.

[22]  Stephanie Tortu,et al.  Recruitment of heterosexual couples in public health research: a study protocol , 2003, BMC medical research methodology.

[23]  Lisa G. Johnston,et al.  Methods to Recruit Hard-to-Reach Groups: Comparing Two Chain Referral Sampling Methods of Recruiting Injecting Drug Users Across Nine Studies in Russia and Estonia , 2006, Journal of Urban Health.

[24]  Douglas D. Heckathorn,et al.  From Networks to Populations: The Development and Application of Respondent-Driven Sampling Among IDUs and Latino Gay Men , 2005, AIDS and Behavior.

[25]  Michael G. Thomason,et al.  Syntactic Methods in Pattern Recognition , 1982 .

[26]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[27]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[28]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .

[29]  Dimitri Prybylski,et al.  Application of Respondent Driven Sampling to Collect Baseline Data on FSWs and MSM for HIV Risk Reduction Interventions in Two Urban Centres in Papua New Guinea , 2006, Journal of Urban Health.

[30]  R J Mills,et al.  Harnessing peer networks as an instrument for AIDS prevention: results from a peer-driven intervention. , 1998, Public health reports.

[31]  Jie Chen,et al.  Overland heroin trafficking routes and HIV-1 spread in south and south-east Asia , 2000, AIDS.

[32]  Courtney McKnight,et al.  Respondent-Driven Sampling in a Study of Drug Users in New York City: Notes from the Field , 2006, Journal of Urban Health.

[33]  Rebeca Ramos,et al.  Trends in Production, Trafficking, and Consumption of Methamphetamine and Cocaine in Mexico , 2006, Substance use & misuse.

[34]  Rebeca Ramos,et al.  Respondent-Driven Sampling of Injection Drug Users in Two U.S.–Mexico Border Cities: Recruitment Dynamics and Impact on Estimates of HIV and Syphilis Prevalence , 2006, Journal of Urban Health.

[35]  Douglas D. Heckathorn,et al.  Respondent-driven sampling : A new approach to the study of hidden populations , 1997 .

[36]  Douglas D. Heckathorn,et al.  Implementation and Analysis of Respondent Driven Sampling: Lessons Learned from the Field , 2006, Journal of Urban Health.

[37]  Matthew J. Salganik,et al.  5. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling , 2004 .

[38]  Elizabeth Pisani,et al.  Sexual Behavior Among Injection Drug Users in 3 Indonesian Cities Carries a High Potential for HIV Spread to Noninjectors , 2003, Journal of acquired immune deficiency syndromes.

[39]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[40]  Jean Berstel,et al.  Context-Free Languages and Pushdown Automata , 1997, Handbook of Formal Languages.

[41]  Tian Zheng,et al.  How Many People Do You Know in Prison? , 2006 .

[42]  A. C. Esq.,et al.  XXVIII. On the theory of the analytical forms called trees , 1857 .