On the Estimation of Missing Data in Incomplete Databases: Autoregressive Bayesian Networks

Missing data can be estimated by means of interpo- lation, time series modelling, or exploiting statistically dependent information. The limits of when one approach is preferable to the alternatives have not been explored, but are likely to be a com- promise between a signal autoregressive information, availability of future observations, stationary behaviour and the strength of the dependence with concomitant information. This paper takes a first step towards clarifying dataset characteristics delimiting the realm of application for each technique. In addition, this paper introduces autoregressive Bayesian networks (AR-BN), a variant of Dynamic Bayesian Networks for completing databases which exploits latent variable relations while still benefitting from autoregressive information of the variable being filled. Using AR-BN, new estimated values are calculated using inference in the dynamic model. Our results unveil how the interplay between the variable autoregressive information and the variable relationship to others in the dataset is critical to selecting the optimal data estimation technique. AR-BN appears as a good candidate ensuring a consistent performance across scenarios, datasets and error metrics.

[1]  Keinosuke Fukunaga,et al.  An Algorithm for Finding Intrinsic Dimensionality of Data , 1971, IEEE Transactions on Computers.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  G. Box,et al.  Bayesian analysis of some outlier problems in time series , 1979 .

[4]  D Marr,et al.  Theory of edge detection , 1979, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[5]  C. Muirhead Distinguishing Outlier Types in Time Series , 1986 .

[6]  Peter Lancaster,et al.  Curve and surface fitting - an introduction , 1986 .

[7]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[8]  R. Tsay Outliers, Level Shifts, and Variance Changes in Time Series , 1988 .

[9]  Kristian G. Olesen,et al.  HUGIN - A Shell for Building Bayesian Belief Universes for Expert Systems , 1989, IJCAI.

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[12]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[13]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[14]  Nathan S. Balke Detecting Level Shifts in Time Series , 1993 .

[15]  Franz von Kutschera,et al.  Causation , 1993, J. Philos. Log..

[16]  Beata Walczak Outlier detection in multivariate calibration , 1995 .

[17]  Vojkan Mihajlovic,et al.  Dynamic Bayesian Networks: A State of the Art , 2001 .

[18]  M. A. Al-Marhoun,et al.  Prediction of Oil PVT Properties Using Neural Networks , 2001 .

[19]  Michael J. Piovoso,et al.  A method of robust multivariate outlier replacement , 2002 .

[20]  Naoki Tanaka,et al.  Wavelet analysis for detecting body-movement artifacts in optical topography signals , 2006, NeuroImage.

[21]  Sunil Vadera,et al.  A Probabilistic Model for Information and Sensor Validation , 2006, Comput. J..

[22]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[23]  Pablo H. Ibargüengoytia,et al.  On-Line Viscosity Virtual Sensor for Optimizing the Combustion in Power Plants , 2010, IBERAMIA.

[24]  Marina V. Fomina,et al.  Problem of knowledge discovery in noisy databases , 2011, Int. J. Mach. Learn. Cybern..

[25]  Louis Wehenkel,et al.  Data validation and missing data reconstruction using self-organizing map for water treatment , 2011, Neural Computing and Applications.

[26]  Luis Enrique Sucar,et al.  A framework for oil well production data validation , 2012 .

[27]  W. Marsden I and J , 2012 .

[28]  S. Peng,et al.  Partial least squares and random sample consensus in outlier detection. , 2012, Analytica chimica acta.

[29]  Pablo Hernandez-Leal,et al.  Learning temporal nodes Bayesian networks , 2013, Int. J. Approx. Reason..