Getting the Most out of your Data: Multitask Bayesian Network Structure Learning, Predicting Good Probabilities and Ensemble Selection

First, I consider the problem of simultaneously learning the structures of multiple Bayesian networks from multiple related datasets. I present a multitask Bayes net structure learning algorithm that is able to learn more accurate network structures by transferring useful information between the datasets. The algorithm extends the score and search techniques used in traditional structure learning to the multitask case by defining a scoring function for sets of structures (one structure for each task) and an efficient procedure for searching for a high scoring set of structures. I also address the task selection problem in the context of multitask Bayes net structure learning. Unlike in other multitask learning scenarios, in the Bayes net structure learning setting there is a clear definition of task relatedness: two tasks are related if they have similar structures. This allows one to automatically select a set of related tasks to be used by multitask structure learning. Second, I examine the relationship between the predictions made by different supervised learning algorithms and true posterior probabilities. I show that quasi-maximum margin methods such as boosted decision trees and SVMs push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Naive Bayes pushes probabilities toward 0 and 1. Other models such as neural nets, logistic regression and bagged trees usually do not have these biases and predict well calibrated probabilities. I experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. I qualitatively examine what distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. Third, I present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. The main drawback of ensemble selection is that it builds models that are very large and slow at test time. This drawback, however, can be overcome with little or no loss in performance by using model compression.

[1]  J. Langford,et al.  FeatureBoost: A Meta-Learning Algorithm that Improves Model Robustness , 2000, ICML.

[2]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[3]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[4]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[5]  Michael P. Wellman,et al.  Real-world applications of Bayesian networks , 1995, CACM.

[6]  Foster Provost,et al.  Tree Induction vs. Logistic Regression for Learning Rankings based on Likelihood of Class Membership , 2002 .

[7]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[10]  Gregory F. Cooper,et al.  The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks , 1989, AIME.

[11]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[12]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[13]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[14]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[15]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[16]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[17]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[18]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[19]  Paul W. Munro,et al.  Competition Among Networks Improves Committee Performance , 1996, NIPS.

[20]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[21]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[22]  Andreas Zell,et al.  SNNS (Stuttgart Neural Network Simulator) , 1994 .

[23]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[24]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[25]  Wray L. Buntine A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[26]  Lefteris Angelis,et al.  Selective fusion of heterogeneous classifiers , 2005, Intell. Data Anal..

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  Gonzalo Martínez-Muñoz,et al.  Pruning in ordered bagging ensembles , 2006, ICML.

[29]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[30]  Pedro M. Domingos Bayesian Averaging of Classifiers and the Overfitting Problem , 2000, ICML.

[31]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  William Nick Street,et al.  Ensemble Pruning Via Semi-definite Programming , 2006, J. Mach. Learn. Res..

[34]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[35]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[36]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[37]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[38]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[39]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[40]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[41]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[42]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[43]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[44]  T. Ben-David,et al.  Exploiting Task Relatedness for Multiple , 2003 .

[45]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[46]  Neil D. Lawrence,et al.  Learning to learn with the informative vector machine , 2004, ICML.

[47]  Brendan Juba,et al.  Estimating relatedness via data compression , 2006, ICML.

[48]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[49]  Daniel L. Silver,et al.  The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness , 1996, Connect. Sci..

[50]  Rich Caruana,et al.  Introduction to IND and recursive partitioning, version 1.0 , 1991 .

[51]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[52]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[53]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[54]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[55]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[56]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[57]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[58]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[59]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[60]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[61]  David W. Opitz,et al.  Feature Selection for Ensembles , 1999, AAAI/IAAI.