论文信息 - Generalization of boosting algorithms and applications of bayesian inference for massive datasets

Generalization of boosting algorithms and applications of bayesian inference for massive datasets

In recent years statisticians, computational learning theorists, and engineers have developed more advance techniques to learn complex non-linear relationships from datasets. However, not only have models increased in complexity, but also datasets have outgrown many of the computational methods for fitting the models that are in standard statistical practice. The first several sections of this dissertation show how boosting, a technique originating in computational learning theory, has wide application for learning non-linear relationships even when the datasets are potentially massive. I describe particular applications of boosting for naive Bayes classification and regression, and exponential family and proportional hazards regression models. I also show how these methods may easily incorporate many desirable properties including robust regression, variance reduction methods, and interpretability. On both real and simulated datasets and in a variety of modeling frameworks, boosting consistently outperforms standard methods in terms of error on validation datasets. In separate but related work, the last chapter presents ideas for utilizing Bayesian methods for inference in massive datasets. Modern Bayesian analysis relies on Monte Carlo methods for sampling from complex posterior distributions, a Bayesian hierarchical model perhaps. These methods experience tremendous slowdown in computation when the posterior distribution conditions on a large dataset and the dataset cannot be summarized in terms of a small number of sufficient statistics. I develop an adaptive importance sampling algorithm that efficiently simulates draws from a posterior distribution conditioned on a massive dataset. Subsequently, I also propose a method for approximate Bayesian inference using likelihood clustering for data reduction.