Semi-Supervised Learning for Natural Language

Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available "for free" in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using the Perceptron algorithm. We also compare Markov and semi-Markov models on the two segmentation tasks. Our results show that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in terms of reducing the error using a fixed amount of labeled data. We find that sometimes semi-Markov models can also improve performance over Markov models. Thesis Supervisor: Michael Collins Title: Assistant Professor, CSAIL

[1]  R. L. Bradshaw,et al.  RESULTS AND ANALYSIS. , 1971 .

[2]  R. Sproat A statistical method for finding word boundaries in Chinese text , 1990 .

[3]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[6]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[7]  Adwait Ratnaparkhi,et al.  A maximum entropy model for parsing , 1994, ICSLP.

[8]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[9]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[10]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[11]  Carl de Marcken,et al.  The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[12]  Beth M. Sundheim,et al.  Overview of Results of the MUC-6 Evaluation , 1995, MUC.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[16]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[17]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[20]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[21]  Joe F. Zhou,et al.  Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[22]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[23]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[24]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[25]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[26]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.

[27]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[28]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[29]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[30]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[31]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[32]  Richard Sproat,et al.  Corpus-Based Methods in Chinese Morphology and Phonology , 2001 .

[33]  Michael Collins,et al.  Parameter Estimation for Statistical Parsing Models: Theory and Practice of , 2001, IWPT.

[34]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[35]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[36]  Dale Schuurmans,et al.  A Hierarchical EM Approach to Word Segmentation , 2001, NLPRS.

[37]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[38]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[39]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[42]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[43]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[44]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[45]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[46]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[47]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[48]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[49]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[50]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[51]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[52]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[53]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[54]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[55]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[56]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[57]  Changning Huang,et al.  Improved Source-Channel Models for Chinese Word Segmentation , 2003, ACL.

[58]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[59]  Changning Huang,et al.  Chinese Word Segmentation: A Pragmatic Approach , 2004 .

[60]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[61]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[62]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[63]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[64]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[65]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[66]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[67]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[68]  Fernando Pereira,et al.  Case-factor diagrams for structured probabilistic modeling , 2004, J. Comput. Syst. Sci..

[69]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[70]  Steven P. Abney Understanding the Yarowsky Algorithm , 2004, CL.

[71]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[72]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[73]  Zhongmin Shi,et al.  Intimate Learning: A Novel Approach for Combining Labelled and Unlabelled Data , 2005, IJCAI.

[74]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[75]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[76]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.