Semi-Supervised Learning via Generalized Maximum Entropy

The maximum entropy (MaxEnt) framework has been studied extensively in the supervised setting. Here, the goal is to find a distribution p that maximizes an entropy function while enforcing data constraints so that the expected values of some (pre-defined) features with respect to p match their empirical counterparts approximately. Using different entropy measures, different model spaces for p, and different approximation criteria for the data constraints, yields a family of discriminative supervised learning methods (e.g., logistic regression, conditional random fields, least squares and boosting) (Dudik & Schapire, 2006; Friedlander & Gupta, 2006; Altun & Smola, 2006). This framework is known as the generalized maximum entropy framework. Semi-supervised learning (SSL) is a promising field that has increasingly attracted attention in the last decade. SSL algorithms utilize unlabeled data along with labeled data so as to increase the accuracy and robustness of inference algorithms. However, most SSL algorithms to date have had trade-offs, e.g., in terms of scalability or applicability to multi-categorical data. In this thesis, we extend the generalized MaxEnt framework to develop a family of novel SSL algorithms using two different approaches: (1) Introducing Similarity Constraints We incorporate unlabeled data via modifications to the primal MaxEnt objective in terms of additional potential functions. A potential function stands for a closed proper convex function that can take the form of a constraint and/or a penalty representing our structural assumptions on the data geometry. Specifically, we impose similarity constraints as additional penalties based on the semi-supervised smoothness assumption , i.e., we restrict the MaxEnt problem such that similar samples have similar model outputs. The motivation is reminiscent of that of Laplacian SVM (Sindhwani et al., 2005) and manifold transductive neural networks (Karlen et al., 2008), however, instead of regularizing the loss function in the dual we integrate our constraints directly to the primal MaxEnt problem which has a more straight-forward and natural interpretation. (2) Augmenting Constraints on Model Features We incorporate unlabeled data to enhance the moment matching constraints of the generalized MaxEnt problem in the primal. We improve the estimates of the model and empirical expectations of the feature functions using our assumptions on the data geometry. In particular, we derive the semi-supervised formulations for three specific instances of the generalized MaxEnt framework on conditional distributions, namely logistic regression and kernel logistic regression for multi-class problems, and conditional random fields for structured output prediction problems. A thorough empirical evaluation on standard data sets that are widely used in the literature demonstrates the validity and competitiveness of the proposed algorithms. In addition to these benchmark data sets, we apply our approach to two real-life problems, vision based robot grasping, and remote sensing image classification where the scarcity of the labeled training samples is the main bottleneck in the learning process. For the particular case of grasp learning, we also propose a combination of semi-supervised learning and active learning, another sub-field of machine learning that is focused on the scarcity of labeled samples, when the problem setup is suitable for incremental labeling. To conclude, the novel SSL algorithms proposed in this thesis have numerous advantages over the existing semi-supervised algorithms as they yield convex, scalable, inherently multi-class loss functions that can be kernelized naturally.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  T. M. Lillesand,et al.  Remote Sensing and Image Interpretation , 1980 .

[4]  Matthew Thomas Mason,et al.  Manipulator grasping and pushing operations , 1982 .

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[8]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[9]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[10]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[11]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[15]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[16]  Yoram Singer,et al.  Boosting Applied to Tagging and PP Attachment , 1999, EMNLP.

[17]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[18]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[19]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[20]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[21]  Vijay Kumar,et al.  Robotic grasping and contact: a review , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[22]  Eiji Watanabe,et al.  A Distributed-Cooperative Learning Algorithm for Multi-Layered Neural Networks using a PC Cluster , 2001 .

[23]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[27]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[28]  Tommi S. Jaakkola,et al.  Information Regularization with Partially Labeled Data , 2002, NIPS.

[29]  Bernhard Schölkopf,et al.  Learning to Find Pre-Images , 2003, NIPS.

[30]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[31]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[32]  M. Todd Convex Analysis and Nonlinear Optimization: Theory and Examples. Jonathan M. Borwein and Adrian S. Lewis, Springer, New York, 2000 , 2003 .

[33]  Markus Lappe,et al.  Biologically Motivated Multi-modal Processing of Visual Primitives , 2003 .

[34]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[35]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[36]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[37]  Ruzena Bajcsy,et al.  Active learning for vision-based robot grasping , 1996, Machine Learning.

[38]  James J. Kuffner,et al.  Effective sampling and distance metrics for 3D rigid body path planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[39]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[40]  Antonio Torralba,et al.  Contextual Models for Object Detection Using Boosted Random Fields , 2004, NIPS.

[41]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[42]  Antonio Morales,et al.  An active learning approach for assessing robot grasp reliability , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[43]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[44]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[45]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[46]  Roger L. Freeman Wiley Series in Telecommunications and Signal Processing , 2005 .

[47]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[48]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[49]  Thomas Hofmann,et al.  Discriminative Methods for Label Sequence Learning , 2005 .

[50]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[51]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[52]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[53]  Jun'ichi Tsujii,et al.  Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization , 2005, Machine Learning.

[54]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[55]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[56]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[57]  Ivor W. Tsang,et al.  Large-Scale Sparsified Manifold Regularization , 2006, NIPS.

[58]  John G. van Bosse,et al.  Wiley Series in Telecommunications and Signal Processing , 2006 .

[59]  Michael P. Friedlander,et al.  On minimizing distortion and relative entropy , 2006, IEEE Transactions on Information Theory.

[60]  Luis Gómez-Chova,et al.  Urban monitoring using multi-temporal SAR and multi-spectral data , 2006, Pattern Recognit. Lett..

[61]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[62]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[63]  Miroslav Dudík,et al.  Maximum Entropy Distribution Estimation with Generalized Regularization , 2006, COLT.

[64]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[65]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[66]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[67]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[68]  Ryan M. Rifkin,et al.  Value Regularization and Fenchel Duality , 2007, J. Mach. Learn. Res..

[69]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[70]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[71]  Olaf Sporns Editorial: Introduction to the Special Issue with Papers from the Fifth International Conference on Development and Learning (ICDL) , 2007, Adapt. Behav..

[72]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[73]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[74]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[75]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[76]  Miroslav Dudík Maximum entropy density estimation and modeling geographic distributions of species , 2007 .

[77]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[78]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[79]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[80]  Maria-Florina Balcan,et al.  New theoretical frameworks for machine learning , 2008 .

[81]  A. Fagg,et al.  Learning Grasp Affordances Through Human Demonstration , 2008 .

[82]  Matthias W. Seeger,et al.  Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods , 2008, J. Mach. Learn. Res..

[83]  Lawson L. S. Wong,et al.  Learning Grasp Strategies with Partial Shape Information , 2008, AAAI.

[84]  Nicolas Pugeault,et al.  Early cognitive vision: feedback mechanisms for the disambiguation of early visual representation , 2008 .

[85]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[86]  Jason Weston,et al.  Large scale manifold transduction , 2008, ICML '08.

[87]  Alexander J. Smola,et al.  Distribution Matching for Transduction , 2009, NIPS.

[88]  N. Kruger,et al.  Learning object-specific grasp affordance densities , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[89]  Andrew McCallum,et al.  Alternating Projections for Learning with Expectation Constraints , 2009, UAI.

[90]  Antonio Torralba,et al.  Semi-Supervised Learning in Gigantic Image Collections , 2009, NIPS.

[91]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[92]  Mikhail Belkin,et al.  Semi-Supervised Learning Using Sparse Eigenfunction Bases , 2009, AAAI Fall Symposium: Manifold Learning and Its Applications.

[93]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[94]  Justus H. Piater,et al.  A Probabilistic Framework for 3D Visual Object Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Maya R. Gupta,et al.  Similarity-based Classification: Concepts and Algorithms , 2009, J. Mach. Learn. Res..

[96]  Matthias Seeger,et al.  Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.