Semi-supervised learning for structured regression on partially observed attributed graphs

Conditional probabilistic graphical models provide a powerful framework for structured regression in spatio-temporal datasets with complex correlation patterns. However, in real-life applications a large fraction of observations is often missing, which can severely limit the representational power of these models. In this paper we propose a Marginalized Gaussian Conditional Random Fields (m-GCRF) structured regression model for dealing with missing labels in partially observed temporal attributed graphs. This method is aimed at learning with both labeled and unlabeled parts and effectively predicting future values in a graph. The method is even capable of learning from nodes for which the response variable is never observed in history, which poses problems for many state-of-the-art models that can handle missing data. The proposed model is characterized for various missingness mechanisms on 500 synthetic graphs. The benefits of the new method are also demonstrated on a challenging application for predicting precipitation based on partial observations of climate variables in a temporal graph that spans the entire continental US. We also show that the method can be useful for optimizing the costs of data collection in climate applications via active reduction of the number of weather stations to consider. In experiments on these real-world and synthetic datasets we show that the proposed model is consistently more accurate than alternative semi-supervised structured models, as well as models that either use imputation to deal with missing values or simply ignore them altogether.

[1]  Lawrence Carin,et al.  Incomplete-data classification using logistic regression , 2005, ICML.

[2]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[3]  Tao Qin,et al.  Global Ranking Using Continuous Conditional Random Fields , 2008, NIPS.

[4]  Zoran Obradovic,et al.  Imputation of Missing Links and Attributes in Longitudinal Social Surveys , 2011, ICDM Workshops.

[5]  K. P. Moustris,et al.  Rain intensity forecast using Artificial Neural Networks in Athens, Greece , 2010 .

[6]  Zoran Obradovic,et al.  Continuous Conditional Random Fields for Regression in Remote Sensing , 2010, ECAI.

[7]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[8]  Nataliya Sokolovska,et al.  Aspects of Semi-supervised and Active Learning in Conditional Random Fields , 2011, ECML/PKDD.

[9]  R. Reynolds,et al.  The NCEP/NCAR 40-Year Reanalysis Project , 1996, Renewable Energy.

[10]  Andrew McCallum,et al.  Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Zoubin Ghahramani,et al.  Optimization with EM and Expectation-Conjugate-Gradient , 2003, ICML.

[13]  R. S. Govindaraju,et al.  Artificial Neural Networks in Hydrology , 2010 .

[14]  John A. Dracup,et al.  Artificial Neural Networks and Long-Range Precipitation Prediction in California , 2000 .

[15]  Jin Tian,et al.  Graphical Models for Inference with Missing Data , 2013, NIPS.

[16]  Jörg Drechsler,et al.  Multiple Imputation for Nonresponse , 2011 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  John B Carlin,et al.  Recovery of information from multiple imputation: a simulation study , 2012, Emerging Themes in Epidemiology.

[19]  Patrick E. McKnight Missing Data: A Gentle Introduction , 2007 .

[20]  Tamara G. Kolda,et al.  Scalable Tensor Factorizations for Incomplete Data , 2010, ArXiv.

[21]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[22]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[23]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[24]  Nikos A. Vlassis,et al.  Gaussian fields for semi-supervised regression and correspondence learning , 2006, Pattern Recognit..