Discriminative, Generative and Self-Supervised Approaches for Target-Agnostic Learning

Supervised learning, characterized by both discriminative and generative learning, seeks to predict the values of single (or sometimes multiple) predefined target attributes based on a predefined set of predictor attributes. For applications where the information available and predictions to be made may vary from instance to instance, we propose the task of target-agnostic learning where arbitrary disjoint sets of attributes can be used for each of predictors and targets for each to-be-predicted instance. For this task, we survey a wide range of techniques available for handling missing values, self-supervised training and pseudo-likelihood training, and adapt them to a suite of algorithms that are suitable for the task. We conduct extensive experiments on this suite of algorithms on a large collection of categorical, continuous and discretized datasets, and report their performance in terms of both classification and regression errors. We also report the training and prediction time of these algorithms when handling large-scale datasets. Both generative and self-supervised learning models are shown to perform well at the task, although their characteristics towards the different types of data are quite different. Nevertheless, our derived theorem for the pseudo-likelihood theory also shows that they are related for inferring a joint distribution model based on the pseudo-likelihood training.

[1]  Geoffrey I. Webb,et al.  Scalable Learning of Bayesian Network Classifiers , 2016, J. Mach. Learn. Res..

[2]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[3]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[4]  Xuanjing Huang,et al.  Pre-trained Models for Natural Language Processing: A Survey , 2020, ArXiv.

[5]  Dmitry Vetrov,et al.  Variational Autoencoder with Arbitrary Conditioning , 2018, ICLR.

[6]  Geoffrey I. Webb,et al.  Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes , 2017, Machine Learning.

[7]  Yuehai Wang,et al.  Text-Conditioned Transformer for Automatic Pronunciation Error Detection , 2020 .

[8]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[9]  Geoffrey I. Webb,et al.  A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests , 2016, KDD.

[10]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[11]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[12]  Geoffrey I. Webb,et al.  Scaling log-linear analysis to datasets with thousands of variables , 2015, SDM.

[13]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[14]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[15]  Wray L. Buntine,et al.  Experiments with learning graphical models on text , 2018 .

[16]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[17]  Geoffrey I. Webb,et al.  Efficient parameter learning of Bayesian network classifiers , 2017, Machine Learning.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[20]  J. Graham,et al.  How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory , 2007, Prevention Science.

[21]  Geoffrey I. Webb,et al.  Scaling Log-Linear Analysis to High-Dimensional Data , 2013, 2013 IEEE 13th International Conference on Data Mining.

[22]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[23]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[24]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[26]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[27]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[28]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[29]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[30]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[31]  Ke Wang,et al.  MIDA: Multiple Imputation Using Denoising Autoencoders , 2017, PAKDD.

[32]  Tomi Silander,et al.  Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures , 2018, AISTATS.

[33]  Manjusha Pandey,et al.  A comprehensive survey and analysis of generative models in machine learning , 2020, Comput. Sci. Rev..

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[36]  Geoffrey I. Webb,et al.  Alleviating naive Bayes attribute independence assumption by attribute weighting , 2013, J. Mach. Learn. Res..

[37]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[38]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[39]  Matthew D. Hoffman,et al.  Variational Autoencoders for Collaborative Filtering , 2018, WWW.

[40]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[41]  Wray Buntine,et al.  Bayesian network classifiers using ensembles and smoothing , 2020, Knowledge and Information Systems.

[42]  Geoffrey I. Webb,et al.  A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data , 2014, 2014 IEEE International Conference on Data Mining.

[43]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[44]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[45]  Jukka Corander,et al.  Marginal Pseudo-Likelihood Learning of Discrete Markov Network Structures , 2017 .

[46]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[47]  Geoffrey I. Webb,et al.  Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification , 2011, Machine Learning.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[50]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[51]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[52]  Virginia R. de Sa,et al.  Learning Classification with Unlabeled Data , 1993, NIPS.

[53]  Yehuda Koren,et al.  Advances in Collaborative Filtering , 2011, Recommender Systems Handbook.

[54]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[55]  Geoffrey I. Webb,et al.  Scalable Learning of Graphical Models , 2016, KDD.