General Table Completion using a Bayesian Nonparametric Model

Even though heterogeneous databases can be found in a broad variety of applications, there exists a lack of tools for estimating missing data in such databases. In this paper, we provide an efficient and robust table completion tool, based on a Bayesian nonparametric latent feature model. In particular, we propose a general observation model for the Indian buffet process (IBP) adapted to mixed continuous (real-valued and positive real-valued) and discrete (categorical, ordinal and count) observations. Then, we propose an inference algorithm that scales linearly with the number of observations. Finally, our experiments over five real databases show that the proposed approach provides more robust and accurate estimates than the standard IBP and the Bayesian probabilistic matrix factorization with Gaussian observations.

[1]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[2]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[3]  Thore Graepel,et al.  Automated Probabilistic Modelling for Relational Data , 2013 .

[4]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[5]  Zoubin Ghahramani,et al.  Accelerated sampling for the Indian Buffet Process , 2009, ICML '09.

[6]  Michalis K. Titsias,et al.  The Infinite Gamma-Poisson Feature Model , 2007, NIPS.

[7]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[8]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[9]  Roberto Todeschini,et al.  Quantitative Structure − Activity Relationship Models for Ready Biodegradability of Chemicals , 2013 .

[10]  Mark Girolami,et al.  Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors , 2006, Neural Computation.

[11]  Lawrence Carin,et al.  Inferring Latent Structure From Mixed Real and Categorical Relational Data , 2012, ICML.

[12]  Fernando Pérez-Cruz,et al.  Bayesian nonparametric comorbidity analysis of psychiatric disorders , 2014, J. Mach. Learn. Res..

[13]  Marie Chavent,et al.  Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms , 2013, NIPS.

[14]  Xiao-Bai Li A Bayesian Approach for Estimating and Replacing Missing Categorical Data , 2009, JDIQ.

[15]  Fernando Pérez-Cruz,et al.  Bayesian Nonparametric Modeling of Suicide Attempts , 2012, NIPS.

[16]  Joshua B. Tenenbaum,et al.  A probabilistic model of cross-categorization , 2011, Cognition.

[17]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[18]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[19]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[20]  David M. Blei,et al.  Bayesian Nonparametric Poisson Factorization for Recommendation Systems , 2014, AISTATS.

[21]  C. Robert Simulation of truncated normal variables , 2009, 0907.4010.