论文信息 - Semi-Supervised Learning for Natural Language

Semi-Supervised Learning for Natural Language

Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available "for free" in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using the Perceptron algorithm. We also compare Markov and semi-Markov models on the two segmentation tasks. Our results show that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in terms of reducing the error using a fixed amount of labeled data. We find that sometimes semi-Markov models can also improve performance over Markov models. Thesis Supervisor: Michael Collins Title: Assistant Professor, CSAIL

Percy Liang | P. Liang

[1] R. L. Bradshaw,et al. RESULTS AND ANALYSIS. , 1971 .

[2] R. Sproat. A statistical method for finding word boundaries in Chinese text , 1990 .

[3] Fernando Pereira,et al. Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[4] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[5] Andreas Stolcke,et al. Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[6] Bernard Mérialdo,et al. Tagging English Text with a Probabilistic Model , 1994, CL.

[7] Adwait Ratnaparkhi,et al. A maximum entropy model for parsing , 1994, ICSLP.

[8] Douglas E. Appelt,et al. SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[9] Hermann Ney,et al. Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[10] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[11] Carl de Marcken,et al. The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[12] Beth M. Sundheim,et al. Overview of Results of the MUC-6 Evaluation , 1995, MUC.

[13] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[15] Maosong Sun,et al. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[16] Nancy Chinchor,et al. Overview of MUC-7 , 1998, MUC.

[17] Yoav Freund,et al. Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[18] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[19] J. C. BurgesChristopher. A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[20] L. Dekang,et al. Extracting collocations from text corpora , 1998 .

[21] Joe F. Zhou,et al. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[22] Ralph Grishman,et al. A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[23] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[24] Ellen Riloff,et al. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[25] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.