An annotated dataset of literary entities

We present a new dataset comprised of 210,532 tokens evenly drawn from 100 different English-language literary texts annotated for ACE entity categories (person, location, geo-political entity, facility, organization, and vehicle). These categories include non-named entities (such as “the boy”, “the kitchen”) and nested structure (such as [[the cook]’s sister]). In contrast to existing datasets built primarily on news (focused on geo-political entities and organizations), literary texts offer strikingly different distributions of entity categories, with much stronger emphasis on people and description of settings. We present empirical results demonstrating the performance of nested entity recognition models in this domain; training natively on in-domain literary data yields an improvement of over 20 absolute points in F-score (from 45.7 to 68.3), and mitigates a disparate impact in performance for male and female entities present in models trained on news data.

[1]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[2]  Bruce J Hillman Gender Bias. , 2018, Journal of the American College of Radiology : JACR.

[3]  Elizabeth F. Evans,et al.  Nation, Ethnicity, and the Geography of British Fiction, 1880-1940 , 2018 .

[4]  Sophia Ananiadou,et al.  A Neural Layered Model for Nested Named Entity Recognition , 2018, NAACL.

[5]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[6]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[7]  D. Tenen Toward a Computational Archaeology of Fictional Space , 2018 .

[8]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[9]  David Bamman,et al.  The Transformation of Gender in English-Language Fiction , 2018 .

[10]  Paul Rayson,et al.  A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing , 2017, GeoHumanities@SIGSPATIAL.

[11]  Wei Lu,et al.  Labeling Gaps Between Words: Recognizing Overlapping Mentions with Mention Separators , 2017, EMNLP.

[12]  Rachael Tatman,et al.  Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.

[13]  Snigdha Chaturvedi,et al.  Unsupervised Learning of Evolving Relationships Between Literary Characters , 2017, AAAI.

[14]  Joanna Bryson,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[15]  Timothy Baldwin,et al.  Bootstrapped Text-level Named Entity Recognition for Literature , 2016, ACL.

[16]  Scott Nesbit,et al.  Creating a Novel Geolocation Corpus from Historical Texts , 2016, LAW@ACL.

[17]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[18]  Jordan L. Boyd-Graber,et al.  Feuding Families and Former Friends: Unsupervised Learning for Dynamic Fictional Relationships , 2016, NAACL.

[19]  Derek Ruths,et al.  Mr. Bennet, his coachman, and the Archbishop walk into a bar but only one of them gets recognized: On The Difficulty of Detecting Characters in Literary Texts , 2015, EMNLP.

[20]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[21]  Anders Søgaard,et al.  Estimating effect size across datasets , 2013, NAACL.

[22]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[23]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[24]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[25]  Dan Roth,et al.  Joint Mention Extraction and Classification with Mention Hypergraphs , 2015, EMNLP.

[26]  David Bamman,et al.  A Bayesian Mixed Effects Model of Literary Character , 2014, ACL.