AN ABSTRACT OF THE DISSERTATION OF Guohua Hao for the degree of Doctor of Philosophy in Computer Science presented on July 21, 2009. Title: Efficient Training and Feature Induction in Sequential Supervised Learning Abstract approved:

approved: Thomas G. Dietterich Sequential supervised learning problems arise in many real applications. This dissertation focuses on two important research directions in sequential supervised learning: efficient training and feature induction. In the direction of efficient training, we study the training of conditional random fields (CRFs), which provide a flexible and powerful model for sequential supervised learning problems. Existing training algorithms for CRFs are slow, particularly in problems with large numbers of potential input features and feature combinations. In this dissertation, we describe a new algorithm, TREECRF, for training CRFs via gradient tree boosting. In TREECRF, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Detailed experimental results are provided to evaluate the performance of the TREECRF algorithm and possible extensions of this algorithm are discussed. We also study the problem of handling missing input values in CRFs, which has been rarely discussed in the literature. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values in CRFs. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. In the direction of feature induction, we study the search-based structured learning framework and its application to sequential supervised learning problems. By formulating the label sequence prediction process as an incremental search process from one end of a sequence to the other, this framework is able to avoid complicated inference algorithms in the training process and thus achieves very fast training speed. However, for problems where there exist long range dependencies between the current position and future positions, at each search step, this framework is unable to exploit these dependencies to make accurate predictions. In this dissertation, a multiple-instance learning based algorithm is proposed to automatically extract useful features from future positions as a way to discover and exploit these long range dependencies. Integrating this algorithm with maximum entropy Markov models yields promising experimental results on both synthetic data sets and real data sets that have long range dependencies in sequences. c ©Copyright by Guohua Hao July 21, 2009 All Rights Reserved Efficient Training and Feature Induction in Sequential Supervised Learning by Guohua Hao A DISSERTATION submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented July 21, 2009 Commencement June 2010 Doctor of Philosophy dissertation of Guohua Hao presented on July 21, 2009.

[1]  Thomas G. Dietterich,et al.  Gradient Tree Boosting for Training Conditional Random Fields , 2008 .

[2]  Kotagiri Ramamohanarao,et al.  Conditional Random Fields for Intrusion Detection , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[3]  Henry A. Kautz,et al.  Training Conditional Random Fields Using Virtual Evidence Boosting , 2007, IJCAI.

[4]  Alan Fern,et al.  Discriminative Learning of Beam-Search Heuristics for Planning , 2007, IJCAI.

[5]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[6]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[7]  Andrew McCallum,et al.  Reducing Weight Undertraining in Structured Discriminative Learning , 2006, NAACL.

[8]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[9]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[10]  Ashwin Srinivasan,et al.  Multi-instance tree learning , 2005, ICML.

[11]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[12]  William W. Cohen,et al.  Stacked Sequential Learning , 2005, IJCAI.

[13]  Peter Auer,et al.  A Boosting Approach to Multiple Instance Learning , 2004, ECML.

[14]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[15]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[16]  Thomas G. Dietterich,et al.  Training conditional random fields via gradient tree boosting , 2004, ICML.

[17]  Zhi-Hua Zhou Multi-Instance Learning : A Survey , 2004 .

[18]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[19]  Adam Ashenfelter,et al.  Sequential Supervised Learning and Conditional Random Fields , 2003 .

[20]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[21]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[22]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[23]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[24]  Qi Zhang,et al.  Content-Based Image Retrieval Using Multiple-Instance Learning , 2002, ICML.

[25]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[26]  Zhi-Hua Zhou,et al.  Neural Networks for Multi-Instance Learning , 2002 .

[27]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[28]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[29]  Yann Chevaleyre,et al.  Solving Multiple-Instance and Multiple-Part Learning Problems with Decision Trees and Rule Sets. Application to the Mutagenesis Problem , 2001, Canadian Conference on AI.

[30]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[31]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[32]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[33]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[34]  Jan Ramon,et al.  Multi instance neural networks , 2000, ICML 2000.

[35]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[36]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[37]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[38]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[39]  Thomas G. Dietterich,et al.  Achieving High-Accuracy Text-to-Speech with Machine Learning , 1997 .

[40]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[41]  A. Agresti An introduction to categorical data analysis , 1997 .

[42]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[43]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[44]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[45]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[46]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[48]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[49]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[50]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..