Unsupervised Induction of Labeled Parse Trees by Clustering with Syntactic Features

We present an algorithm for unsupervised induction of labeled parse trees. The algorithm has three stages: bracketing, initial labeling, and label clustering. Bracketing is done from raw text using an unsupervised incremental parser. Initial labeling is done using a merging model that aims at minimizing the grammar description length. Finally, labels are clustered to a desired number of labels using syntactic features extracted from the initially labeled trees. The algorithm obtains 59% labeled f-score on the WSJ10 corpus, as compared to 35% in previous work, and substantial error reduction over a random baseline. We report results for English, German and Chinese corpora, using two label mapping methods and two label set sizes.

[1]  Katsuhiko Nakamura Incremental Learning of Context Free Grammars by Bridging Rule Generation and Search for Semi-optimum Rule Sets , 2006, ICGI.

[2]  W. Bruce Croft Radical Construction Grammar , 2001 .

[3]  Rens Bod,et al.  Unsupervised Parsing with U-DOP , 2006, CoNLL.

[4]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[5]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[6]  Peter Grünwald,et al.  A minimum description length approach to grammar inference , 1995, Learning for Natural Language Processing.

[7]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[8]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[9]  Noah A. Smith,et al.  Annealing Structural Bias in Multilingual Weighted Grammar Induction , 2006, ACL.

[10]  Simon Dennis,et al.  An exemplar-based approach to unsupervised parsing , 2005 .

[11]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[12]  Menno van Zaanen,et al.  Bootstrapping structure into language : alignment-based learning , 2001, ArXiv.

[13]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[14]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[15]  Walter Daelemans,et al.  Memory-based lexical acquisition and processing , 1993, EAMT.

[16]  Rens Bod,et al.  Is the End of Supervised Parsing in Sight? , 2007, ACL.

[17]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[18]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[19]  Rens Bod,et al.  An All-Subtrees Approach to Unsupervised Parsing , 2006, ACL.

[20]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[21]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[22]  Adele E. Goldberg,et al.  Constructions at Work , 2005 .

[23]  Katsuhiko Nakamura,et al.  Incremental Learning of Context Free Grammars , 2002, ICGI.

[24]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[25]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[26]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[27]  Stanley F. Chen,et al.  Bayesian Grammar Induction for Language Modeling , 1995, ACL.

[28]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[29]  Willem H. Zuidema,et al.  Bayesian Model Merging for Unsupervised Constituent Labeling and Grammar Induction , 2022 .

[30]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[31]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[32]  Pat Langley,et al.  Learning Context-Free Grammars with a Simplicity Bias , 2000, ECML.

[33]  J. Gerard Wolff,et al.  Language acquisition, data compression and generalization , 1982 .

[34]  Dan Klein,et al.  Prototype-Driven Grammar Induction , 2006, ACL.

[35]  George A. Miller,et al.  Language and Communication , 1951 .

[36]  Georgios Paliouras,et al.  e-GRIDS: Computationally Efficient Gramatical Inference from Positive Examples , 2004, Grammars.