Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

[1]  P. S. Horn,et al.  Effect of outliers and nonhealthy individuals on reference interval estimation. , 2001, Clinical chemistry.

[2]  I. Amorim,et al.  Identification of prognostic factors in canine mammary malignant tumours: a multivariable survival study , 2013, BMC Veterinary Research.

[3]  Philip S. Yu,et al.  Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing , 2017, Proc. VLDB Endow..

[4]  Sunny Chen,et al.  Identifying and categorizing spurious weight data in electronic medical records. , 2018, The American journal of clinical nutrition.

[5]  K. Fry,et al.  Estimating the contribution of a service delivery organisation to the national modern contraceptive prevalence rate: Marie Stopes International's Impact 2 model , 2013, BMC Public Health.

[6]  S. Daniels,et al.  Prevalence of obesity and extreme obesity in children aged 3–5 years , 2014, Pediatric obesity.

[7]  G. Fitzmaurice,et al.  Incidence and remission rates of overweight among children aged 5 to 13 years in a district-wide school surveillance system. , 2005, American journal of public health.

[8]  Joan Kimmelman Royal (Dick) School of Veterinary Studies , 2007, Veterinary Record.

[9]  John A Spertus,et al.  Precision in Weighing: A Comparison of Scales Found in Physician Offices, Fitness Centers, and Weight Loss Centers , 2005, Public health reports.

[10]  Kazi Shah Nawaz Ripon,et al.  A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates , 2010, J. Comput..

[11]  D. Kuh,et al.  How Has the Age-Related Process of Overweight or Obesity Development Changed over Time? Co-ordinated Analyses of Individual Participant Data from Five United Kingdom Birth Cohorts , 2015, PLoS medicine.

[12]  Donald C. Pierson,et al.  Data handling: cleaning and quality control. In Obrador, B., Jones, I.D. and Jennings, E. (Eds.) NETLAKE toolbox for the analysis of high-frequency data from lakes (Factsheet 1). , 2016 .

[13]  D. Allison,et al.  Validity of the WHO cutoffs for biologically implausible values of weight, height, and BMI in children and adolescents in NHANES from 1999 through 2012 1,,2 , 2015 .

[14]  W. Ollier,et al.  Dogslife: A web-based longitudinal study of Labrador Retriever health in the UK , 2013, BMC Veterinary Research.

[15]  C. Byrd-Bredbenner,et al.  Accuracy and consistency of weights provided by home bathroom scales , 2013, BMC Public Health.

[16]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[17]  S. Fook-Chong,et al.  Creation and validation of the Singapore birth nomograms for birth weight, length and head circumference based on a 12-year birth cohort. , 2014, Annals of the Academy of Medicine, Singapore.

[18]  A. R. Frisancho Physical Status: The Use and Interpretation of Anthropometry , 1996, The American Journal of Clinical Nutrition.

[19]  M. Marino,et al.  Not so implausible: impact of longitudinal assessment of implausible anthropometric measures on obesity prevalence and weight change in children and adolescents. , 2019, Annals of epidemiology.

[20]  C. Power,et al.  Cohort profile: 1958 British birth cohort (National Child Development Study). , 2006, International journal of epidemiology.

[21]  J. Osborne Data Cleaning Basics: Best Practices in Dealing with Extreme Scores , 2010 .

[22]  Christopher Eager,et al.  Mixed Effects Models are Sometimes Terrible , 2017 .

[23]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[24]  J. Hutcheon,et al.  Identifying outliers and implausible values in growth trajectory data. , 2016, Annals of epidemiology.

[25]  D. Kuh,et al.  Socioeconomic Inequalities in Body Mass Index across Adulthood: Coordinated Analyses of Individual Participant Data from Three British Birth Cohort Studies Initiated in 1946, 1958 and 1970 , 2017, PLoS medicine.

[26]  J. Eisenmann,et al.  Child-specific food insecurity and overweight are not associated in a sample of 10- to 15-year-old low-income youth. , 2008, The Journal of nutrition.

[27]  Sadao Suzuki,et al.  Accuracy of self‐reported height, weight and waist circumference in a Japanese sample , 2017, Obesity science & practice.

[28]  J. Engstrom,et al.  Accuracy of self-reported height and weight in women: an integrative review of the literature. , 2003, Journal of midwifery & women's health.

[29]  Peter Shepherd,et al.  Cohort profile: 1970 British Birth Cohort (BCS70). , 2006, International journal of epidemiology.

[30]  J. Gorter,et al.  Becoming and staying physically active in adolescents with cerebral palsy: protocol of a qualitative study of facilitators and barriers to physical activity , 2011, BMC pediatrics.

[31]  D. Schopflocher,et al.  The pot calling the kettle black: the extent and type of errors in a computerized immunization registry and by parent report , 2014, BMC Pediatrics.

[32]  L. Dubois,et al.  Accuracy of maternal reports of pre-schoolers' weights and heights as estimates of BMI values. , 2007, International journal of epidemiology.

[33]  G Tromp,et al.  A Rigorous Algorithm To Detect And Clean Inaccurate Adult Height Records Within EHR Systems , 2014, Applied Clinical Informatics.

[34]  K. Flegal,et al.  Comparisons of Self‐Reported and Measured Height and Weight, BMI, and Obesity Prevalence from National Surveys: 1999‐2016 , 2019, Obesity.

[35]  Bas E. Dutilh,et al.  Dispersion of the HIV-1 Epidemic in Men Who Have Sex with Men in the Netherlands: A Combined Mathematical Model and Phylogenetic Analysis , 2015, PLoS medicine.

[36]  Sara Cordes,et al.  1 < 2 and 2 < 3: Non-Linguistic Appreciations of Numerical Order , 2013, Front. Psychology.

[37]  C. Ogden,et al.  Comparing Methods for Identifying Biologically Implausible Values in Height, Weight, and Body Mass Index Among Youth. , 2015, American journal of epidemiology.

[38]  Hude Quan,et al.  An administrative data merging solution for dealing with missing data in a clinical registry: adaptation from ICD-9 to ICD-10 , 2008, BMC medical research methodology.

[39]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[40]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[41]  Shumei S. Guo,et al.  2000 CDC Growth Charts for the United States: methods and development. , 2002, Vital and health statistics. Series 11, Data from the National Health Survey.

[42]  Yoav Ben-Shlomo,et al.  SITAR--a useful instrument for growth curve analysis. , 2010, International journal of epidemiology.

[43]  P. B. Eveleth,et al.  Physical Status: The Use and Interpretation of Anthropometry. Report of a WHO Expert Committee , 1996 .

[44]  S. Pocock,et al.  The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. , 2007, Preventive medicine.

[45]  D. Strobino,et al.  Early maternal depressive symptoms and child growth trajectories: a longitudinal analysis of a nationally representative US birth cohort , 2014, BMC Pediatrics.

[46]  R. Collins,et al.  Underestimation of risk associations due to regression dilution in long-term follow-up of prospective studies. , 1999, American journal of epidemiology.

[47]  Roger Eeckels,et al.  Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities , 2005, PLoS medicine.

[48]  M. Thun,et al.  Body-mass index and mortality in a prospective cohort of U.S. adults. , 1999, The New England journal of medicine.

[49]  Judith W. Dexheimer,et al.  A Comparison of Existing Methods to Detect Weight Data Errors in a Pediatric Academic Medical Center , 2018, AMIA.

[50]  Juan Romo,et al.  Shape outlier detection and visualization for functional data: the outliergram. , 2013, Biostatistics.

[51]  D. De Bacquer,et al.  Validity of parent-reported weight and height of preschool children measured at home or estimated without home measurement: a validation study , 2011, BMC pediatrics.

[52]  Richard Wasserman,et al.  Automated identification of implausible values in growth data from pediatric electronic health records , 2017, J. Am. Medical Informatics Assoc..

[53]  I. Hendriksen,et al.  Accuracy of self-reported body weight, height and waist circumference in a Dutch overweight working population , 2008, BMC medical research methodology.

[54]  J. Osborne Is data cleaning and the testing of assumptions relevant in the 21st century? , 2013, Front. Psychol..

[55]  Harvey Goldstein,et al.  Data Processing for Longitudinal Studies , 1970 .

[56]  I. Buchan,et al.  Developing a network for small animal disease surveillance , 2010, Veterinary Record.

[57]  K. Tu,et al.  Completeness and accuracy of anthropometric measurements in electronic medical records for children attending primary care , 2018, BMJ Health & Care Informatics.

[58]  T. Cole,et al.  Growth standard charts for monitoring bodyweight in dogs of different sizes , 2017, PloS one.

[59]  D. Roth,et al.  New approach for the identification of implausible values and outliers in longitudinal childhood anthropometric data , 2018, Annals of epidemiology.

[60]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[61]  I. White,et al.  Two‐stage method to remove population‐ and individual‐level outliers from longitudinal data in a primary care database , 2012, Pharmacoepidemiology and drug safety.

[62]  Douglas G Altman,et al.  [The Strengthening the Reporting of Observational Studies in Epidemiology [STROBE] statement: guidelines for reporting observational studies]. , 2007, Gaceta sanitaria.