Data pre-processing to improve the mining of large feed databases.

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

[1]  James E. Reinmuth,et al.  Statistics for Management and Economics , 1975 .

[2]  F. M. Molina,et al.  Propuesta para la homogenización de la información sobre alimentos: aplicación a la base de datos pastos españoles (SEEP) , 2011 .

[3]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[4]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[5]  G. Tran,et al.  The French feed database: a national network for collecting and disseminating data about feedstuff composition and nutritive value , 1997 .

[6]  Harry Archimède,et al.  Data engineering for creating feed tables and animal models in the tropical context , 2010 .

[7]  Gary W. Fick,et al.  Quantifying Morphological Development of Alfalfa for Studies of Herbage Quality , 1981 .

[8]  Selwyn Piramuthu,et al.  On preprocessing data for financial credit risk evaluation , 2006, Expert Syst. Appl..

[9]  J Thibault,et al.  Statistical data validation methods for large cheese plant database. , 2002, Journal of dairy science.

[10]  Ronald D. Hatfield,et al.  Can Lignin Be Accurately Measured , 2005 .

[11]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[12]  Luis O. Tedeschi,et al.  Development and evaluation of a tropical feed library for the Cornell Net Carbohydrate and Rrotein System model , 2002 .

[13]  Gustavo E. A. P. A. Batista Data pre-processing for supervised machine learning , 2003 .

[14]  Richard Y. Wang,et al.  Toward quality data: An attribute-based approach , 2014, Decis. Support Syst..

[15]  L. E. Harris,et al.  NUTRITIONAL COMPOSITION OF LATIN AMERICAN FORAGES 1 , 1977 .

[16]  Zengyou He,et al.  Mining class outliers: concepts, algorithms and applications in CRM , 2004, Expert Syst. Appl..

[17]  T. Jenkins,et al.  Challenges with fats and fatty acid methods. , 2003, Journal of animal science.

[18]  T. W. Anderson,et al.  Asymptotic Theory of Certain "Goodness of Fit" Criteria Based on Stochastic Processes , 1952 .

[19]  Larry,et al.  Alfalfa Growth and Development , 2000 .

[20]  William Chauvenet,et al.  A manual of spherical and practical astronomy , 1891 .

[21]  Luis Carlos Molina Félix Data mining: torturando a los datos hasta que confiesen , 2002 .

[22]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[23]  D. Sauvant TABLE DE COMPOSITION ET DE VALEUR NUTRITIVE DES MATIERES PREMIERES DESTINEES AUX ANIMAUX DELEVAGE , 2004 .

[24]  Ian Witten,et al.  Data Mining , 2000 .

[25]  G. Gizzi Variability in feed composition and its impact on animal production. , 2004 .

[26]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[27]  Zhijin Wu,et al.  A review of statistical methods for preprocessing oligonucleotide microarrays , 2009, Statistical methods in medical research.

[28]  A. Garrido-Varo,et al.  Building a metadata framework for sharing feed information in Spain. , 2011, Journal of animal science.

[29]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[30]  K. Moore,et al.  Describing and Quantifying Growth Stages of Perennial Forage Grasses , 1991 .