Shrinking a large dataset to identify variables associated with increased risk of Plasmodium falciparum infection in Western Kenya

SUMMARY Large datasets are often not amenable to analysis using traditional single-step approaches. Here, our general objective was to apply imputation techniques, principal component analysis (PCA), elastic net and generalized linear models to a large dataset in a systematic approach to extract the most meaningful predictors for a health outcome. We extracted predictors for Plasmodium falciparum infection, from a large covariate dataset while facing limited numbers of observations, using data from the People, Animals, and their Zoonoses (PAZ) project to demonstrate these techniques: data collected from 415 homesteads in western Kenya, contained over 1500 variables that describe the health, environment, and social factors of the humans, livestock, and the homesteads in which they reside. The wide, sparse dataset was simplified to 42 predictors of P. falciparum malaria infection and wealth rankings were produced for all homesteads. The 42 predictors make biological sense and are supported by previous studies. This systematic data-mining approach we used would make many large datasets more manageable and informative for decision-making processes and health policy prioritization.

[1]  L. Pritchett,et al.  Estimating Wealth Effects Without Expenditure Data—Or Tears: An Application To Educational Enrollments In States Of India* , 2001, Demography.

[2]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[5]  L. F. Chaves,et al.  Push by a net, pull by a cow: can zooprophylaxis enhance the impact of insecticide treated bed nets on malaria control? , 2014, Parasites & Vectors.

[6]  Pierre Legendre,et al.  Numerical Ecology with R , 2011 .

[7]  Joanna Schellenberg,et al.  Performance of HRP-2 based rapid diagnostic test for malaria and its variation with age in an area of intense malaria transmission in southern tanzania , 2010, Malaria Journal.

[8]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[9]  Andy Field,et al.  Discovering statistics using SPSS, 2nd ed. , 2005 .

[10]  F. Cox Basic laboratory methods in medical parasitology , 1992 .

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[13]  Andy P. Field,et al.  Discovering Statistics Using SPSS , 2000 .