Variance of Average Surprisal: A Better Predictor for Quality of Grammar from Unsupervised PCFG Induction

In unsupervised grammar induction, data likelihood is known to be only weakly correlated with parsing accuracy, especially at convergence after multiple runs. In order to find a better indicator for quality of induced grammars, this paper correlates several linguistically- and psycholinguistically-motivated predictors to parsing accuracy on a large multilingual grammar induction evaluation data set. Results show that variance of average surprisal (VAS) better correlates with parsing accuracy than data likelihood and that using VAS instead of data likelihood for model selection provides a significant accuracy boost. Further evidence shows VAS to be a better candidate than data likelihood for predicting word order typology classification. Analyses show that VAS seems to separate content words from function words in natural language grammars, and to better arrange words with different frequencies into separate classes that are more consistent with linguistic theory.

[1]  Lane Schwartz,et al.  Memory-Bounded Left-Corner Unsupervised Grammar Induction on Child-Directed Input , 2016, COLING.

[2]  Lane Schwartz,et al.  Unsupervised Grammar Induction with Depth-bounded PCFG , 2018, TACL.

[3]  George A. Miller,et al.  Introduction to the Formal Analysis of Natural Languages , 1968 .

[4]  Charles Yang,et al.  Who's Afraid of George Kingsley Zipf? , 2010 .

[5]  Thomas Hofmann,et al.  Speakers optimize information density through syntactic reduction , 2007 .

[6]  Frank Keller,et al.  The Entropy Rate Principle as a Predictor of Processing Effort: An Evaluation against Eye-tracking Data , 2004, EMNLP.

[7]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[8]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[9]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[10]  Matthew S. Dryer,et al.  The evidence for word order correlations , 2011 .

[11]  R. Schiffer Psychobiology of Language , 1986 .

[12]  Mark Johnson,et al.  Using Left-corner Parsing to Encode Universal Structural Constraints in Grammar Induction , 2016, EMNLP.

[13]  John A. Hawkins,et al.  A Performance Theory of Order and Constituency , 1995 .

[14]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[15]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[16]  Michael White,et al.  Investigating locality effects and surprisal in written English syntactic choice phenomena , 2016, Cognition.

[17]  Jason Baldridge,et al.  Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models , 2011, ACL.

[18]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[19]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[20]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[21]  Charles D. Yang Rage against the machine: Evaluation metrics in the 21st century , 2017 .

[22]  Noah A. Smith,et al.  Concavity and Initialization for Unsupervised Dependency Parsing , 2012, NAACL.

[23]  M. Dryer The Greenbergian word order correlations , 1992 .

[24]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[25]  Stephen T. Wu,et al.  Complexity Metrics in an Incremental Right-Corner Parser , 2010, ACL.

[26]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[27]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[28]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[29]  Kewei Tu,et al.  Unsupervised learning of probabilistic grammars , 2012 .

[30]  Mark Johnson,et al.  Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing , 2009, NAACL.

[31]  Lane Schwartz,et al.  Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction , 2018, EMNLP.

[32]  Noah A. Smith,et al.  Novel estimation methods for unsupervised discovery of latent structure in natural language text , 2007 .

[33]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[34]  Aaron C. Courville,et al.  Neural Language Modeling by Jointly Learning Syntax and Lexicon , 2017, ICLR.