Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar

Scaling wide-coverage, constraint-based grammars such as Lexical-Functional Grammars (LFG) (Kaplan and Bresnan, 1982; Bresnan, 2001) or Head-Driven Phrase Structure Grammars (HPSG) (Pollard and Sag, 1994) from fragments to naturally occurring unrestricted text is knowledge-intensive, time-consuming and (often prohibitively) expensive. A number of researchers have recently presented methods to automatically acquire wide-coverage, probabilistic constraint-based grammatical resources from treebanks (Cahill et al., 2002, Cahill et al., 2003; Cahill et al., 2004; Miyao et al., 2003; Miyao et al., 2004; Hockenmaier and Steedman, 2002; Hockenmaier, 2003), addressing the knowledge acquisition bottleneck in constraint-based grammar development. Research to date has concentrated on English and German. In this paper we report on an experiment to induce wide-coverage, probabilistic LFG grammatical and lexical resources for Chinese from the Penn Chinese Treebank (CTB) (Xue et al., 2002) based on an automatic f-structure annotation algorithm. Currently 96.751% of the CTB trees receive a single, covering and connected f-structure, 0.112% do not receive an fstructure due to feature clashes, while 3.137% are associated with multiple f-structure fragments. From the f-structure-annotated CTB we extract a total of 12975 lexical entries with 20 distinct subcategorisation frame types. Of these 3436 are verbal entries with a total of 11 different frame types. We extract a number of PCFG-based LFG approximations. Currently our best automatically induced grammars achieve an f-score of 81.57% against the trees in unseen articles 301-325; 86.06% f-score (all grammatical functions) and 73.98% (preds-only) against the dependencies derived from the f-structures automatically generated for the original trees in 301-325 and 82.79% (all grammatical functions) and 67.74% (preds-only) against the dependencies derived from the manually annotated gold-standard f-structures for 50 trees randomly selected from articles 301-325.

[1]  Her One-Soon CHINESE INVERSION CONSTRUCTIONS WITHIN A SIMPLIFIED LMT , 2003 .

[2]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[3]  Andy Way,et al.  Data-oriented parsing and the Penn Chinese treebank , 2004 .

[4]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[5]  J. Bresnan Lexical-Functional Syntax , 2000 .

[6]  Adams Bodomo,et al.  Double Object and Serial Verb Benefactive Constructions in Cantonese , 2004 .

[7]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[8]  Mark Steedman,et al.  Generative Models for Statistical Parsing with Combinatory Categorial Grammar , 2002, ACL.

[9]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[10]  Julia Hockenmaier Parsing with Generative Models of Predicate-Argument Structure , 2003, ACL.

[11]  Andy Way,et al.  Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank , 2004, ACL.

[12]  Andy Way,et al.  Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations , 2004, ACL.

[13]  Andy Way,et al.  Treebank-based multilingual unification-grammar development , 2003 .

[14]  Stefan Riezler,et al.  A Comparison of Evaluation Metrics for a Broad-Coverage Stochastic Parser , 2003 .

[15]  Kang-Kwong Luke Lexical-functional Grammar: Analysis of Chinese , 2003 .

[16]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[17]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[18]  Josef van Genabith,et al.  LFG for Chinese : Issues of Representation and Computation , 2001 .

[19]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[20]  Jun'ichi Tsujii,et al.  Corpus-Oriented Grammar Development for Acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank , 2004, IJCNLP.

[21]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[22]  David Chiang,et al.  Recovering Latent Information in Treebanks , 2002, COLING.

[23]  Mark Johnson,et al.  Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques , 2002, ACL.

[24]  Mary Dalrymple,et al.  The PARC 700 Dependency Bank , 2003, LINC@EACL.

[25]  Yusuke Miyao,et al.  Probabilistic modeling of argument structures including non-local dependencies , 2003 .