Data mining and machine learning techniques applied to public health problems: A bibliometric analysis from 2009 to 2018

Abstract The objective of this paper is to present a bibliometric analysis of the applications of Data Mining (DM) and Machine Learning (ML) techniques in the context of public health from 2009 to 2018. A systematic review of the literature was conducted considering three major scientific databases: Scopus, Web of Science and Science Direct. This enabled an analysis of the number of papers by journal, the countries where the applications were carried out, which databases are more commonly used, the most studied topics in public health, and the techniques, programming languages and software applications most frequently used by researchers. Our results showed a slight increase in the number of papers published in 2014 and a significative increase since 2017, focusing mostly on infectious, parasitic and communicable diseases, chronic diseases and risk factors for chronic diseases. The Journal of Medical Internet Research and PLoS ONE published the highest number of papers. Support Vector Machines (SVM) were the most common technique, while R and WEKA were the most common programming language and software application, respectively. The U.S. was the most common country where the studies were conducted. In addition, Twitter was the most frequently used source of data by researchers. Hence, this paper provides an overview of the literature on DM and ML in the field of public health and serves as a starting point for beginner and experienced researchers interested in this topic.

[1]  Edgardo Ferretti,et al.  Predicting Depression: a comparative study of machine learning approaches based on language usage , 2017 .

[2]  D. Burwen,et al.  Comparing data mining methods on the VAERS database , 2005, Pharmacoepidemiology and drug safety.

[3]  Ernesto Damiani,et al.  Privacy-aware Big Data Analytics as a service for public health policies in smart cities , 2018 .

[4]  Dongsong Zhang,et al.  The public's opinions on a new school meals policy for childhood obesity prevention in the U.S.: A social media analytics approach , 2017, Int. J. Medical Informatics.

[5]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[6]  K. Goddard,et al.  "Assessing the methodological quality of systematic reviews in radiation oncology: A systematic review". , 2017, Cancer epidemiology.

[7]  John Britton,et al.  Use of varenicline for smoking cessation treatment in UK primary care: an association rule mining analysis , 2014, BMC Public Health.

[8]  C Helma,et al.  Prediction of Adverse Drug Reactions Using Decision Tree Modeling , 2010, Clinical pharmacology and therapeutics.

[9]  Andrew P. Bradley,et al.  Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus , 2010, IEEE Transactions on Information Technology in Biomedicine.

[10]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[11]  Jingcheng Du,et al.  Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data , 2017, BMC Medical Informatics and Decision Making.

[12]  Bernd Rechel,et al.  Funding for public health in Europe in decline? , 2019, Health policy.

[13]  B. Mwangi,et al.  The impact of machine learning techniques in the study of bipolar disorder: A systematic review , 2017, Neuroscience & Biobehavioral Reviews.

[14]  K. Mandl,et al.  Associations Between Exposure to and Expression of Negative Opinions About Human Papillomavirus Vaccines on Social Media: An Observational Study , 2015, Journal of medical Internet research.

[15]  Yves Rybarczyk,et al.  Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review , 2018, Applied Sciences.

[16]  Michael J. Paul,et al.  Discovering Health Topics in Social Media Using Topic Models , 2014, PloS one.

[17]  Ali Idri,et al.  Knowledge discovery in cardiology: A systematic literature review , 2017, Int. J. Medical Informatics.

[18]  Emilia Mendes,et al.  Prognosis of Dementia Employing Machine Learning and Microsimulation Techniques: A Systematic Literature Review , 2016, CENTERIS/ProjMAN/HCist.

[19]  Arnold Neumaier,et al.  A tree-based statistical classification algorithm (CHAID) for identifying variables responsible for the occurrence of faecal indicator bacteria during waterworks operations , 2014 .

[20]  Nisreen I. R. Yassin,et al.  Machine learning techniques for breast cancer computer aided diagnosis using different image modalities: A systematic review , 2018, Comput. Methods Programs Biomed..

[21]  I. Vlahavas,et al.  Machine Learning and Data Mining Methods in Diabetes Research , 2017, Computational and structural biotechnology journal.

[22]  Yu Zhan,et al.  Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm , 2017 .

[23]  Régis Beuscart,et al.  Data Mining to Generate Adverse Drug Events Detection Rules , 2011, IEEE Transactions on Information Technology in Biomedicine.

[24]  Chieh-Chen Wu,et al.  Prediction of sepsis patients using machine learning approach: A meta-analysis , 2019, Comput. Methods Programs Biomed..

[25]  Caroline Wilkinson,et al.  Acute Malnutrition and Anemia Among Rohingya Children in Kutupalong Camp, Bangladesh , 2018, JAMA.

[26]  Jianzhou Wang,et al.  Short-term effects of air pollution on lower respiratory diseases and forecasting by the group method of data handling , 2012 .

[27]  Durga Toshniwal,et al.  Analysis of hourly road accident counts using hierarchical clustering and cophenetic correlation coefficient (CPCC) , 2016, Journal of Big Data.

[28]  Bartha M Knoppers,et al.  Ethics, big data and computing in epidemiology and public health. , 2017, Annals of epidemiology.

[29]  Michael Mueller,et al.  How much do OECD countries spend on prevention , 2017 .

[30]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[31]  J. Kwon,et al.  An Algorithm Based on Deep Learning for Predicting In‐Hospital Cardiac Arrest , 2018, Journal of the American Heart Association.

[32]  Isabel de la Torre Díez,et al.  Data Mining Algorithms and Techniques in Mental Health: A Systematic Review , 2018, Journal of Medical Systems.

[33]  Ross Jacobucci,et al.  The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: A systematic review. , 2019, Journal of affective disorders.

[34]  Hong Qiao,et al.  Comparing data mining methods with logistic regression in childhood obesity prediction , 2009, Inf. Syst. Frontiers.

[35]  Pyoung Won Kim Operating an environmentally sustainable city using fine dust level big data measured at individual elementary schools , 2018 .

[36]  Sunmoo Yoon,et al.  What can we learn about the Ebola outbreak from tweets? , 2015, American journal of infection control.

[37]  A. van Straten,et al.  Online Training and Support Programs Designed to Improve Mental Health and Reduce Burden Among Caregivers of People With Dementia: A Systematic Review. , 2018, Journal of the American Medical Directors Association.

[38]  Mark R Lehto,et al.  Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review. , 2017, Accident; analysis and prevention.

[39]  Li Li,et al.  Chinese Public Attention to the Outbreak of Ebola in West Africa: Evidence from the Online Big Data Platform , 2016, International journal of environmental research and public health.

[40]  W. Chapman,et al.  Using Twitter to Examine Smoking Behavior and Perceptions of Emerging Tobacco Products , 2013, Journal of medical Internet research.

[41]  Huy Quan Vu,et al.  Domestic Violence Crisis Identification From Facebook Posts Based on Deep Learning , 2018, IEEE Access.

[42]  Katja Radon,et al.  Estimating the Causal Impact of Proximity to Gold and Copper Mines on Respiratory Diseases in Chilean Children: An Application of Targeted Maximum Likelihood Estimation , 2017, International journal of environmental research and public health.

[43]  José M. Merigó,et al.  Forty years of Computers & Industrial Engineering: A bibliometric analysis , 2017, Comput. Ind. Eng..

[44]  M. Petticrew,et al.  Systematic Reviews in the Social Sciences: A Practical Guide , 2005 .

[45]  Osmar R Zaiane,et al.  A systematic review of data mining and machine learning for air pollution epidemiology , 2017, BMC Public Health.

[46]  Hsinchun Chen,et al.  Automatic online news monitoring and classification for syndromic surveillance , 2009, Decision Support Systems.

[47]  Mike Conway,et al.  Social Media, Big Data, and Mental Health: Current Advances and Ethical Implications. , 2016, Current opinion in psychology.

[48]  Manuel Graña,et al.  Predictive models for hospital readmission risk: A systematic review of methods , 2018, Comput. Methods Programs Biomed..

[49]  Dong Keun Kim,et al.  Development of a Stress Classification Model Using Deep Belief Networks for Stress Monitoring , 2017, Healthcare informatics research.

[50]  Juan Romo,et al.  Data learning from big data , 2018 .

[51]  Reza Sigari Tabrizi,et al.  The role of human factor in incidence and severity of road crashes based on the CART and LR regression: a data mining approach , 2011, WCIT.

[52]  Richard T Burnett,et al.  Developing small-area predictions for smoking and obesity prevalence in the United States for use in Environmental Public Health Tracking. , 2014, Environmental research.

[53]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[54]  Tim Lang,et al.  Beyond the Golden Era of public health: charting a path from sanitarianism to ecological public health. , 2015, Public health.

[55]  Li Xiu,et al.  Application of data mining techniques in customer relationship management: A literature review and classification , 2009, Expert Syst. Appl..

[56]  D. Salkeld,et al.  Spatial analysis of plague in California: niche modeling predictions of the current distribution and potential response to climate change , 2009, International Journal of Health Geographics.

[57]  J. A. Ware,et al.  A review of image analysis and machine learning techniques for automated cervical cancer screening from pap-smear images , 2018, Comput. Methods Programs Biomed..

[58]  Mehrbakhsh Nilashi,et al.  Diseases diagnosis using fuzzy logic methods: A systematic and meta-analysis review , 2018, Comput. Methods Programs Biomed..

[59]  D. Tranfield,et al.  Towards a Methodology for Developing Evidence-Informed Management Knowledge by Means of Systematic Review , 2003 .

[60]  Chiavegatto Filho,et al.  Uso de big data em saúde no Brasil: perspectivas para um futuro próximo , 2015 .

[61]  Lutfan Lazuardi,et al.  Diagnostic Accuracy of Different Machine Learning Algorithms for Breast Cancer Risk Calculation: a Meta-Analysis , 2018, Asian Pacific journal of cancer prevention : APJCP.

[62]  Landon Fridman Detwiler,et al.  Visualization and analytics tools for infectious disease epidemiology: A systematic review , 2014, J. Biomed. Informatics.

[63]  Timothy C.Y. Chan,et al.  Applications of machine learning algorithms to predict therapeutic outcomes in depression: A meta-analysis and systematic review. , 2018, Journal of affective disorders.

[64]  Kwok-Leung Tsui,et al.  Forecasting influenza in Hong Kong with Google search queries and statistical model fusion , 2017, PloS one.

[65]  Tony Blakely,et al.  The impact of social housing on mental health: longitudinal analyses using marginal structural models and machine learning-generated weights. , 2018, International journal of epidemiology.

[66]  R. Guha,et al.  What are we ‘tweeting’ about obesity? Mapping tweets with topic modeling and Geographic Information System , 2013, Cartography and geographic information science.

[67]  Kathleen H. Miao,et al.  Coronary Heart Disease Diagnosis using Deep Neural Networks , 2018 .

[68]  Juan Alfonso Lara,et al.  Data preparation for KDD through automatic reasoning based on description logic , 2014, Inf. Syst..

[69]  Sun Xiao,et al.  Trends detection of flu based on ensemble models with emotional factors from social networks , 2017 .

[70]  Peter A Muennig How Automation Can Help Alleviate the Budget Crunch in Public Health Research. , 2015, American journal of public health.

[71]  Chrystalleni Lazarou,et al.  Dietary patterns analysis using data mining method. An application to data from the CYKIDS study , 2012, Comput. Methods Programs Biomed..

[72]  Tim Menzies,et al.  Optimizing data collection for public health decisions: a data mining approach , 2014, BMC Public Health.

[73]  López Griselda,et al.  Using Decision Trees to Extract Decision Rules from Police Reports on Road Accidents , 2012 .

[74]  Tarik Agouti,et al.  An improved approach for association rule mining using a multi-criteria decision support system: a case study in road safety , 2017, European Transport Research Review.

[75]  Jesse O'Shea,et al.  Digital disease detection: A systematic review of event-based internet biosurveillance systems , 2017, International Journal of Medical Informatics.

[76]  Mohammad Nazir Ahmad,et al.  Social media for knowledge-sharing: A systematic literature review , 2018, Telematics Informatics.

[77]  R. Altman,et al.  Detecting Drug Interactions From Adverse‐Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels , 2011, Clinical pharmacology and therapeutics.

[78]  S. Rose Mortality risk score prediction in an elderly population using machine learning. , 2013, American journal of epidemiology.

[79]  Soo Beom Choi,et al.  Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea. , 2018, Journal of affective disorders.

[80]  Rachel L. Goldfeder,et al.  Feasibility of Obtaining Measures of Lifestyle From a Smartphone App: The MyHeart Counts Cardiovascular Health Study , 2017, JAMA cardiology.

[81]  Syed Anas Imtiaz,et al.  Algorithms for Automatic Analysis and Classification of Heart Sounds–A Systematic Review , 2019, IEEE Access.

[82]  Fuji Ren,et al.  Trends detection of flu based on ensemble models with emotional factors from social networks , 2017 .

[83]  S. Emery,et al.  A cross-sectional examination of marketing of electronic cigarettes on Twitter , 2014, Tobacco Control.

[84]  Krishnamoorthi Makkithaya,et al.  Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance , 2017 .

[85]  Philip M. Massey,et al.  Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter , 2016, Journal of medical Internet research.

[86]  Hui-Qi Qu,et al.  The Definition of Insulin Resistance Using HOMA-IR for Americans of Mexican Descent Using Machine Learning , 2011, PloS one.