Algorithms and Analysis for Multi-Category Classification

Classification problems in machine learning involve assigning labels to various kinds of output types, from single assignment binary and multi-class classification to more complex assignments such as category ranking, sequence identification, and structured-output classification. Traditionally, most machine learning algorithms and theory is developed for the binary setting. In this dissertation, we provide a framework to unify these problems. Through this framework, many algorithms and significant theoretic understanding developed in the binary domain is extended to more complex settings. First, we introduce Constraint Classification, a learning framework that provides a unified view of complex-output problems. Within this framework, each complex-output label is viewed as a set of constraints, sufficient enough to capture the information needed to classify the example. Thus, prediction in the complex-output setting is reduced to determining which constraints, out of a potentially large set, hold for a given example---a task that can be accomplished by the repeated application of a single binary classifier to indicate whether or not each constraint holds. Using this insight, we provide a principled extension of binary learning algorithms, such as the support vector machine and the Perceptron algorithm to the complex-output domain. We also show that desirable theoretical and experimental properties of the algorithms are maintained in the new setting. Second, we address the structured output problem directly. Structured output labels are collections of variables corresponding to a known structure, such as a tree, graph, or sequence that can bias or constrain the global output assignment. The traditional approach for learning structured output classifiers, that decomposes a structured output into multiple localized labels to learn independently, is theoretically sub-optimal. In contrast, recent methods, such as constraint classification, that learn functions to directly classify the global output can optimal performance. Surprisingly, in practice it is unclear which methods achieve state-of-the-art performance. In this work, we study under what circumstances each method performs best. With enough time, training data, and representative power, the global approaches are better. However, we also show both theoretically and experimentally that learning a suite of local classifiers, even sub-optimal ones, can produce the best results under many real-world settings. Third, we address an important algorithm in machine learning, the maximum margin classifier. Even with a conceptual understanding of how to extend maximum margin algorithms to more complex settings and performance guarantees of large margin classifiers, complex outputs render traditional approaches intractable in more complex settings. We introduce a new algorithm for learning maximum margin classifiers using coresets to find provably approximate solution to maximum margin linear separating hyperplane. Then, using the constraint classification framework, this algorithm applies directly to all of the previously mentioned complex-output domains. In addition, coresets motivate approximate algorithms for active learning and learning in the presence of outlier noise, where we give simple, elegant, and previously unknown proofs of their effectiveness.

[1]  Xavier Carreras,et al.  Phrase recognition by filtering and ranking with perceptrons , 2003, RANLP.

[2]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[7]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[10]  Nianwen Xue,et al.  Calibrating Features for Semantic Role Labeling , 2004, EMNLP.

[11]  Dan Roth,et al.  A Learning Approach to Shallow Parsing , 1999, EMNLP.

[12]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[13]  Nello Cristianini,et al.  Query Learning with Large Margin Classi ersColin , 2000 .

[14]  Alessandro Sperduti,et al.  Speeding up the solution of multilabel problems with Support Vector Machines , 2003 .

[15]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[16]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[17]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[18]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[19]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[20]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[21]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[22]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[23]  Dan Roth,et al.  Learning in Natural Language , 1999, IJCAI.

[24]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[25]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Dan Roth,et al.  A Sequential Model for Multi-Class Classification , 2001, EMNLP.

[28]  Nils J. Nilsson,et al.  Learning Machines: Foundations of Trainable Pattern-Classifying Systems , 1965 .

[29]  Dan Roth,et al.  Semantic Role Labeling Via Integer Linear Programming Inference , 2004, COLING.

[30]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[31]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[32]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[33]  Koby Crammer,et al.  Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[34]  Claude E. Shannon,et al.  Programming a computer for playing chess , 1950 .

[35]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[36]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[37]  D. Angluin Queries and Concept Learning , 1988 .

[38]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[39]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[40]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[41]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[42]  Dan Roth,et al.  Constraint Classification for Multiclass Classification and Ranking , 2002, NIPS.

[43]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[44]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[45]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[46]  H. Sebastian Seung,et al.  Unsupervised Learning by Convex and Conic Coding , 1996, NIPS.

[47]  Sariel Har-Peled,et al.  Separability with Outliers , 2005, ISAAC.

[48]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[49]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[50]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[51]  Alessandro Sperduti,et al.  Multiclass Classification with Multi-Prototype Support Vector Machines , 2005, J. Mach. Learn. Res..

[52]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[53]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[54]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[55]  Robert H. Sloan,et al.  Corrigendum to types of noise in data for concept learning , 1988, COLT '92.

[56]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[57]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[58]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[59]  Chris Mesterharm A Multi-class Linear Learning Algorithm Related to Winnow , 1999, NIPS.

[60]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[61]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[62]  Alessandro Sperduti,et al.  Learning Preferences for Multiclass Problems , 2004, NIPS.

[63]  Dan Roth,et al.  The Necessity of Syntactic Parsing for Semantic Role Labeling , 2005, IJCAI.

[64]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[65]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[66]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[67]  Wolfgang Maass,et al.  On the Computational Power of Winner-Take-All , 2000, Neural Computation.

[68]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[69]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[70]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[71]  Allen Newell,et al.  Elements of a theory of human problem solving. , 1958 .

[72]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[73]  Philip M. Long,et al.  Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions , 1995, J. Comput. Syst. Sci..

[74]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[75]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[76]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[77]  Padhraic Smyth,et al.  Belief networks, hidden Markov models, and Markov random fields: A unifying view , 1997, Pattern Recognit. Lett..

[78]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[79]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[80]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[81]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[82]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[83]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[84]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[85]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[86]  M. Kearns,et al.  Recent Results on Boolean Concept Learning , 1987 .

[87]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[88]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[89]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .