Imputation of Missing Data in Industrial Databases

A limiting factor for the application of IDA methods in many domains is the incompleteness of data repositories. Many records have fields that are not filled in, especially, when data entry is manual. In addition, a significant fraction of the entries can be erroneous and there may be no alternative but to discard these records. But every cell in a database is not an independent datum. Statistical relationships will constrain and, often determine, missing values. Data imputation, the filling in of missing values for partially missing data, can thus be an invaluable first step in many IDA projects. New imputation methods that can handle the large-scale problems and large-scale sparsity of industrial databases are needed. To illustrate the incomplete database problem, we analyze one database with instrumentation maintenance and test records for an industrial process. Despite regulatory requirements for process data collection, this database is less than 50% complete. Next, we discuss possible solutions to the missing data problem. Several approaches to imputation are noted and classified into two categories: data-driven and model-based. We then describe two machine-learning-based approaches that we have worked with. These build upon well-known algorithms: AutoClass and C4.5. Several experiments are designed, all using the maintenance database as a common test-bed but with various data splits and algorithmic variations. Results are generally positive with up to 80% accuracies of imputation. We conclude the paper by outlining some considerations in selecting imputation methods, and by discussing applications of data imputation for intelligent data analysis.

[1]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[2]  G. A. Geeathouse,et al.  Deterioration of materials. , 1954 .

[3]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[4]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[5]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[6]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[7]  Heike Hofmann,et al.  Interactive Graphics for Data Sets with Missing Values—MANET , 1996 .

[8]  Kevin Thompson,et al.  Cobweb/3: A portable implementation , 1990 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Matthew Self,et al.  Bayesian Classification , 1988, AAAI.

[11]  Werner Vach Missing Values: Statistical Theory and Computational Practice , 1994 .

[12]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[13]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[14]  Andreas Buja,et al.  XGobi: Interactive Dynamic Data Visualization in the X Window System , 1998 .

[15]  Peter Cheeseman,et al.  Bayesian classification theory , 1991 .

[16]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[17]  Deborah F. Swayne,et al.  Missing Data in Interactive High-Dimensional Data Visualization , 1998 .

[18]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.