MIGHT: Statistical Methodology for Missing-Data Imputation in Food Composition Databases

This paper addresses the problem of missing data in food composition databases (FCDBs). The missing data can be either for selected foods or for specific components only. Most often, the problem is solved by human experts subjectively borrowing data from other FCDBs, for data estimation or imputation. Such an approach is not only time-consuming but may also lead to wrong decisions as the value of certain components in certain foods may vary from database to database due to differences in analytical methods. To ease missing-data borrowing and increase the quality of missing-data selection, we propose a new computer-based methodology, named MIGHT - Missing Nutrient Value Imputation UsinG Null Hypothesis Testing, that enables optimal selection of missing data from different FCDBs. The evaluation on a subset of European FCDBs, available through EuroFIR and complied with the Food data structure and format standard BS EN 16104 published in 2012, proves that, in more than 80% of selected cases, MIGHT gives more accurate results than techniques currently applied for missing value imputation in FCDBs. MIGHT deals with missing data in FCDBs by introducing rules for missing data imputation based on the idea that proper statistical analysis can decrease the error of data borrowing.

[1]  Alessandra Durazzo,et al.  Food Composition Databases: Considerations about Complex Food Matrices , 2018, Foods.

[2]  Tome Eftimov,et al.  Quisper Ontology Learning from Personalized Dietary Web Services , 2018, KEOD.

[3]  E.J.G. Pitman Some Basic Theory for Statistical Inference: Monographs on Applied Probability and Statistics , 2017 .

[4]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[5]  Tome Eftimov,et al.  StandFood: Standardization of Foods Using a Semi-Automatic System for Classifying and Describing Foods According to FoodEx2 , 2017, Nutrients.

[6]  Nik,et al.  How to perform properly statistical analysis on food data ? An e-learning tool : Advanced Statistics in Natural Sciences and Technologies , 2017 .

[7]  Tome Eftimov,et al.  Mapping Food Composition Data from Various Data Sources to a Domain-Specific Ontology , 2017, KEOD.

[8]  Paul C Boutros,et al.  Fast and Versatile Non-Negative Matrix Factorization , 2016 .

[9]  Borja Calvo,et al.  scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems , 2016, R J..

[10]  Borja Calvo,et al.  Statistical Comparison of Multiple Algorithms in MultipleProblems , 2015 .

[11]  Tome Eftimov,et al.  POS tagging-probability weighted method for matching the Internet recipe ingredients with food composition data , 2015, 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K).

[12]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[13]  M. Tech,et al.  A Study on the Use of Non-Parametric Tests for Experimentation with Cluster Analysis , 2013 .

[14]  Yasmine Probst,et al.  Continuing education: advanced food composition data use in practice , 2013 .

[15]  Allen Wilhite,et al.  Agent-based models and hypothesis testing: an example of innovation and organizational networks , 2012, The Knowledge Engineering Review.

[16]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[17]  Suvrit Sra,et al.  Sparse nonnegative matrix approximation: new formulations and algorithms , 2010 .

[18]  Mithun Das Gupta Additive Non-negative Matrix Factorization for Missing Data , 2010, ArXiv.

[19]  S. M. Church,et al.  EuroFIR Synthesis report No 7: Food composition explained , 2009 .

[20]  Yulia R. Gel,et al.  lawstat: An R Package for Law, Public Policy and Biostatistics , 2008 .

[21]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[22]  Gene H. Golub,et al.  Matrices, moments, and quadrature , 2007, Milestones in Matrix Computation.

[23]  M. Dehghan,et al.  Food composition database development for between country comparisons , 2006, Nutrition journal.

[24]  C. Williamson Synthesis report No 2: The Different Uses of Food Composition Databases , 2006 .

[25]  Inderjit S. Dhillon,et al.  Generalized Nonnegative Matrix Approximations with Bregman Divergences , 2005, NIPS.

[26]  Amit Mitra,et al.  Statistical Quality Control , 2002, Technometrics.

[27]  Eric R. Ziegel,et al.  Applied Statistics and Probability for Engineers , 2002, Technometrics.

[28]  Susan E. Gebhardt,et al.  Procedures for Estimating Nutrient Values for Food Composition Databases , 1997 .

[29]  Suzanne P. Murphy,et al.  Quality and Accessibility of Food-Related Data , 1996 .

[30]  S. Emmett Quality and accessibility of food-related data , 1995 .

[31]  Brian J. Westrich,et al.  Accuracy and Efficiency of Estimating Nutrient Values in Commercial Food Products Using Mathematical Optimization , 1994 .

[32]  P. C. Meier,et al.  Statistical Methods in Analytical Chemistry , 2005 .

[33]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[34]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[35]  D. A. T. Southgate,et al.  Food Composition Data: Production, Management and Use , 1992 .

[36]  D. Rom A sequentially rejective test procedure based on a modified Bonferroni inequality , 1990 .

[37]  B. Holland,et al.  An Improved Sequentially Rejective Bonferroni Test Procedure , 1987 .

[38]  B. Schultz Levene's Test for Relative Variation , 1985 .

[39]  M. Cowles,et al.  On the Origins of the .05 Level of Statistical Significance , 1982 .

[40]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[41]  E. S. Pearson,et al.  Tests for departure from normality: Comparison of powers , 1977 .

[42]  A. Pettitt Testing the Normality of Several Independent Samples Using the Anderson‐Darling Statistic , 1977 .

[43]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[44]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[45]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[46]  D. Kendall,et al.  The Statistical Analysis of Variance‐Heterogeneity and the Logarithmic Transformation , 1946 .

[47]  Lingsong Zhang,et al.  STATISTICAL METHODS IN BIOLOGY , 1902, Nature.