Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.7-million-word treebank that is an important resource for research in syntactic change, has several properties that present potential challenges for NLP technologies. We describe these key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank, and present results for this corpus using a modified version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al. (2006). While this approach to function tag recovery gives reasonable results, it is in some ways inappropriate for span-based parsers. We also present further evidence of the importance of in-domain pretraining for contextualized word representations. The resulting parser will be used to parse Early English Books Online, a 1.5 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.

[1]  Seth Kulick,et al.  Parsing Early Modern English for Linguistic Search , 2020, ArXiv.

[2]  Andres Karjus Competition, selection and communicative need in language change: an investigation using corpora, computational modelling and experimentation , 2021 .

[3]  Anton Karl Ingason,et al.  Smooth Signals and Syntactic Change , 2021 .

[4]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[5]  Charlotte Galves Relaxed Verb Second in Classical Portuguese , 2020 .

[6]  Khalil Mrini,et al.  Rethinking Self-Attention: Towards Interpretability in Neural Parsing , 2019, FINDINGS.

[7]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[8]  Dogu Araci,et al.  FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[9]  Kyle Gorman,et al.  We Need to Talk about Standard Splits , 2019, ACL.

[10]  William W. Cohen,et al.  Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[11]  Iz Beltagy,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[12]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[13]  Dan Klein,et al.  Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[14]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Weiwei Sun,et al.  Pre- and In-Parsing Models for Neural Empty Category Detection , 2018, ACL.

[17]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Dan Klein,et al.  A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[21]  Joel C. Wallenberg Extraposition is disappearing , 2016 .

[22]  Yi Yang,et al.  Part-of-Speech Tagging for Historical English , 2016, NAACL.

[23]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Aaron Ecay A multi-step analysis of the evolution of English do-support , 2015 .

[25]  Reut Tsarfaty,et al.  Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages , 2014 .

[26]  Seth Kulick,et al.  The Penn Parsed Corpus of Modern British English: First Parsing Results and Analysis , 2014, ACL.

[27]  Yoav Goldberg,et al.  Language-Independent Parsing with Empty Elements , 2011, ACL.

[28]  昌明 神谷,et al.  初期近代英語に現れる小節・結果構文 Penn-Helsinki Parsed Corpus of Early Modern Englishを検索して , 2011 .

[29]  Beatrice Santorini,et al.  Penn parsed corpora of historical English , 2011 .

[30]  Charlotte Galves,et al.  ( Campinas ) Computational and linguistic aspects of the construction of the Tycho Brahe Parsed Corpus of Historical Portuguese , 2008 .

[31]  神谷 昌明,et al.  古英語に現れる小節・結果構文 : York-Toronto-Helsinki Parsed Corpus of Old English Proseを検索して , 2008 .

[32]  Jason Baldridge,et al.  Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts , 2007, EMNLP-CoNLL.

[33]  Seth Kulick,et al.  Fully Parsing the Penn Treebank , 2006, NAACL.

[34]  Gabriele Musillo,et al.  Accurate Function Parsing , 2005, HLT/EMNLP.

[35]  Eugene Charniak,et al.  Function tagging , 2004 .

[36]  Mark Johnson,et al.  A Simple Pattern-matching Algorithm for Recovering Empty Nodes and their Antecedents , 2002, ACL.

[37]  A. Kroch,et al.  The Middle English Verb-Second Constraint: A case study in language contact and languagechange , 2001 .

[38]  Eugene Charniak,et al.  Assigning Function Tags to Parsed Text , 2000, ANLP.

[39]  D. Wing,et al.  Early English books online , 1999 .

[40]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[41]  A. Kroch Reflexes of grammar in patterns of language change , 1989, Language Variation and Change.