An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria

Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.

[1]  F. Migot-Nabias,et al.  Family analysis of malaria infection in Dienga, Gabon. , 2002, The American journal of tropical medicine and hygiene.

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  L. Molineaux,et al.  The Garki project: Research on the epidemiology and control of malaria in the Sudan savanna of West Africa , 1980 .

[4]  K Y Liang,et al.  Longitudinal data analysis for discrete and continuous outcomes. , 1986, Biometrics.

[5]  S. Heath,et al.  Genetic Determination and Linkage Mapping of Plasmodium falciparum Malaria Related Traits in Senegal , 2008, PloS one.

[6]  Thomas A. Smith,et al.  Three different Plasmodium species show similar patterns of clinical tolerance of malaria infection , 2009, Malaria Journal.

[7]  C. Sokhna,et al.  Rapid reappearance of Plasmodium falciparum after drug treatment among Senegalese adults exposed to moderate seasonal transmission. , 2001, The American journal of tropical medicine and hygiene.

[8]  C. Rogier,et al.  Evidence for an age-dependent pyrogenic threshold of Plasmodium falciparum parasitemia in highly endemic populations. , 1996, The American journal of tropical medicine and hygiene.

[9]  C. Rogier,et al.  Heritability of the Human Infectious Reservoir of Malaria Parasites , 2010, PloS one.

[10]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  G. Snounou,et al.  Mixed Infections with Plasmodium falciparum and P malariae and fever In malaria , 1994, The Lancet.

[13]  F. Migot-Nabias,et al.  Factors influencing resistance to reinfection with Plasmodium falciparum. , 1999, The American journal of tropical medicine and hygiene.

[14]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[15]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[16]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[17]  C. Rogier,et al.  Plasmodium falciparum clinical malaria: lessons from longitudinal studies in Senegal. , 1999, Parassitologia.

[18]  C. Drakeley,et al.  Increased Plasmodium falciparum gametocyte production in mixed infections with P. malariae. , 2008, The American journal of tropical medicine and hygiene.

[19]  Mark Von Tress,et al.  Generalized, Linear, and Mixed Models , 2003, Technometrics.

[20]  McKenzie Fe,et al.  Multispecies Plasmodium infections of humans. , 1999 .

[21]  C. Rogier,et al.  The Dielmo project: a longitudinal study of natural malaria infection and the mechanisms of protective immunity in a community living in a holoendemic area of Senegal. , 1994, The American journal of tropical medicine and hygiene.

[22]  Vincent Calcagno,et al.  glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models , 2010 .

[23]  W. Collins,et al.  A retrospective examination of sporozoite- and trophozoite-induced infections with Plasmodium falciparum in patients previously infected with heterologous species of Plasmodium: effect on development of parasitologic and clinical immunity. , 1999, The American journal of tropical medicine and hygiene.

[24]  R. Snow,et al.  Heritability of Malaria in Africa , 2005, PLoS medicine.

[25]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[26]  F. McKenzie,et al.  Plasmodium malariae infection boosts Plasmodium falciparum gametocyte production. , 2002, The American journal of tropical medicine and hygiene.

[27]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[28]  S. Hakomori,et al.  Genomic organization of human histo-blood group ABO genes. , 1995, Glycobiology.

[29]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[30]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[31]  H. Grüneberg,et al.  Introduction to quantitative genetics , 1960 .

[32]  F. Fumoux,et al.  Linkage and association between Plasmodium falciparum blood infection levels and chromosome 5q31–q33 , 2003, Genes and Immunity.

[33]  L. Abel,et al.  Linkage analysis of blood Plasmodium falciparum levels: interest of the 5q31-q33 chromosome region. , 1998, The American journal of tropical medicine and hygiene.

[34]  David R. Cox The analysis of binary data , 1970 .

[35]  L. Abel,et al.  Malaria in humans: Plasmodium falciparum blood infection levels are linked to chromosome 5q31-q33. , 1998, American journal of human genetics.

[36]  L. Molineaux,et al.  A longitudinal study of human malaria in the West African Savanna in the absence of control measures: relationships between different Plasmodium species, in particular P. falciparum and P. malariae. , 1980, The American journal of tropical medicine and hygiene.

[37]  W. Bossert,et al.  Multispecies Plasmodium infections of humans. , 1999, The Journal of parasitology.

[38]  M. Alpers,et al.  Cross-species interactions between malaria parasites in humans. , 2000, Science.