Optimal Code Length Based Cost for Unsupervised Grammar Induction

An effective grammar can be induced from natural language sentences by simultaneously minimizing the cost of encoding the grammar and the cost of encoding the corresponding derivations of the sentences. In previous work, the cost of encoding a derivation was computed in terms of the number of bits it requires to encode which of the possible productions is used to expand each of its non-terminals. However, this ignored the fact that if some productions are used more often than others in the derivations, then they could be encoded with fewer bits using optimal code length based encoding. This paper presents a new derivation cost that uses such an encoding and applies it for inducing grammars. Minimizing this new derivation cost also corresponds to maximizing the probability of the derivations. Thus besides being theoretically more appealing, experimental results on sentences from clinical reports show that this new derivation cost also leads to induction of grammars that have better parsing performance.

[1]  Rohit J. Kate Unsupervised Grammar Induction of Clinical Report Sublanguage , 2011, ICMLA.

[2]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Rohit J. Kate Unsupervised grammar induction of clinical report sublanguage , 2011, Journal of Biomedical Semantics.

[5]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[6]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[10]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[11]  J. Gerard Wolff,et al.  Language acquisition, data compression and generalization , 1982 .

[12]  Chia-Ping Chen,et al.  Automatic Learning of Context-Free Grammar , 2006, ROCLING/IJCLCLP.

[13]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[14]  Pat Langley,et al.  Learning Context-Free Grammars with a Simplicity Bias , 2000, ECML.

[15]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[16]  Jason Baldridge,et al.  Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models , 2011, ACL.