论文信息 - Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes

Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes

This thesis is about the task of unsupervised parser induction: automatically learning grammars and parsing models from raw text. We endeavor to induce such parsers by observing sequences of terminal symbols. We focus on overcoming the problem of frequent collocation that is a major source of error in grammar induction. For example, since a verb and a determiner tend to co-occur in a verb phrase, the probability of attaching the determiner to the verb is sometimes higher than that of attaching the core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of (Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it is precariously capable of distorting the grammar distribution. Natural language grammars follow a Zipfian (power law) distribution, where the frequency of any grammar rule is inversely proportional to its rank in the frequency table. We believe that covering the most frequent grammar rules in grammar induction will have a strong impact on accuracy. We propose an efficient approach to grammar induction guided by cross-linguistic language parameters. Our language parameters consist of 33 parameters of frequent basic word orders, which are easy to be elicited from grammar compendiums or short interviews with naïve language informants. These parameters are designed to capture frequent word orders in the Zipfian distribution of natural language grammars, while the rest of the grammar including exceptions can be automatically induced from unlabeled data. The language parameters shrink the search space of the grammar induction problem by exploiting both word order information and predefined attachment directions. The contribution of this thesis is three-fold. (1) We show that the language parameters are adequately generalizable cross-linguistically, as our grammar induction experiments will be carried out on 14 languages on top of a simple unsupervised grammar induction system. (2) Our specification of language parameters improves the accuracy of unsupervised parsing even when the parser is exposed to much less frequent linguistic phenomena in longer sentences when the accuracy decreases within 10%. (3) We investigate the prevalent factors of errors in grammar induction which will provide room for accuracy improvement.

Prachya Boonkwan | P. Boonkwan

[1] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2] Sabine Buchholz,et al. CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[3] Dan Klein,et al. Natural Language Grammar Induction Using a Constituent-Context Model , 2001, NIPS.

[4] Anders Søgaard. From ranked words to dependency trees: two-stage unsupervised non-projective dependency parsing , 2011, Graph-based Methods for Natural Language Processing.

[5] Phil Blunsom,et al. The PASCAL Challenge on Grammar Induction , 2012, HLT-NAACL 2012.

[6] Julia Hockenmaier,et al. Data and models for statistical parsing with combinatory categorial grammar , 2003 .

[7] Dan Klein,et al. Accurate Unlexicalized Parsing , 2003, ACL.

[8] Mark Steedman,et al. CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[9] Christopher D. Manning,et al. The unsupervised learning of natural language structure , 2005 .

[10] Alexander Koller,et al. Dependency Trees and the Strong Generative Capacity of CCG , 2009, EACL.

[11] Ari Rappoport,et al. Automatic Selection of High Quality Parses Created By a Fully Unsupervised Parser , 2009, CoNLL.