论文信息 - Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.7-million-word treebank that is an important resource for research in syntactic change, has several properties that present potential challenges for NLP technologies. We describe these key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank, and present results for this corpus using a modiﬁed version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al. (2006). While this approach to function tag recovery gives reasonable results, it is in some ways inappropriate for span-based parsers. We also present further evidence of the importance of in-domain pretraining for contextualized word representations. The resulting parser will be used to parse Early English Books Online, a 1.5 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.

[1] Seth Kulick,et al. Parsing Early Modern English for Linguistic Search , 2020, ArXiv.

[2] Andres Karjus. Competition, selection and communicative need in language change: an investigation using corpora, computational modelling and experimentation , 2021 .

[3] Anton Karl Ingason,et al. Smooth Signals and Syntactic Change , 2021 .

[4] Ion Androutsopoulos,et al. LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[5] Charlotte Galves. Relaxed Verb Second in Classical Portuguese , 2020 .

[6] Khalil Mrini,et al. Rethinking Self-Attention: Towards Interpretability in Neural Parsing , 2019, FINDINGS.

[7] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[8] Dogu Araci,et al. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[9] Kyle Gorman,et al. We Need to Talk about Standard Splits , 2019, ACL.

[10] William W. Cohen,et al. Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[11] Iz Beltagy,et al. SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[12] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[13] Dan Klein,et al. Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[14] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[15] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16] Weiwei Sun,et al. Pre- and In-Parsing Models for Neural Empty Category Detection , 2018, ACL.

[17] Dan Klein,et al. Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[18] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[20] Dan Klein,et al. A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[21] Joel C. Wallenberg. Extraposition is disappearing , 2016 .

[22] Yi Yang,et al. Part-of-Speech Tagging for Historical English , 2016, NAACL.