A Challenge Set for Advancing Language Modeling

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Test's sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select the correct replacement from amongst five alternates. In general, all of the options are syntactically valid, and reasonable with respect to local N-gram statistics. The set was generated by using an N-gram language model to generate a long list of likely words, given the immediate context. These options were then hand-groomed, to identify four decoys which are globally incoherent, yet syntactically correct. To ensure the right to public distribution, all the data is derived from out-of-copyright materials from Project Gutenberg. The test sentences were derived from five of Conan Doyle's Sherlock Holmes novels, and we provide a large set of Nineteenth and early Twentieth Century texts as training material.

[1]  Stanley F. Chen,et al.  Performance Prediction for Exponential Language Models , 2009, NAACL.

[2]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[3]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[4]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[5]  Peder A. Olsen,et al.  Theory and practice of acoustic confusability , 2002, Comput. Speech Lang..

[6]  Deniz Yuret,et al.  KU: Word Sense Disambiguation by Substitution , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[7]  Roger K. Moore Computer Speech and Language , 1986 .

[8]  Frederick Jelinek,et al.  Exploiting Syntactic Structure for Language Modeling , 1998, ACL.

[9]  Carlo Strapparava,et al.  FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence , 2007, SemEval@ACL.

[10]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[11]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[12]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[13]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[14]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16]  Roberto Navigli,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[17]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[18]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[19]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[21]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[22]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[23]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[25]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[26]  Jun Wu,et al.  Combining nonlocal, syntactic and n-gram dependencies in language modeling , 1999, EUROSPEECH.

[27]  Cecil R. Reynolds,et al.  Educational Testing Service , 2008 .

[28]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[29]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[31]  Rada Mihalcea,et al.  UNT: SubFinder: Combining Knowledge Sources for Automatic Lexical Substitution , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[32]  Leonard S. Cahen,et al.  Educational Testing Service , 1970 .