FanfictionNLP: A Text Processing Pipeline for Fanfiction

Fanfiction presents an opportunity as a data source for research in NLP, education, and social science. However, answering specific research questions with this data is difficult, since fanfiction contains more diverse writing styles than formal fiction. We present a text processing pipeline for fanfiction, with a focus on identifying text associated with characters. The pipeline includes modules for character identification and coreference, as well as the attribution of quotes and narration to those characters. Additionally, the pipeline contains a novel approach to character coreference that uses knowledge from quote attribution to resolve pronouns within quotes. For each module, we evaluate the effectiveness of various approaches on 10 annotated fanfiction stories. This pipeline outperforms tools developed for formal fiction on the tasks of character coreference and quote attribution

[1]  Henry Jenkins Textual Poachers: Television Fans & Participatory Culture , 1992 .

[2]  David Bamman,et al.  Beyond Canonical Texts: A Computational Analysis of Fanfiction , 2016, EMNLP.

[3]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[4]  Simon Razniewski,et al.  ENTYFI: Entity Typing in Fictional Texts , 2020, WSDM.

[5]  Michael Strube,et al.  Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric , 2016, ACL.

[6]  David Bamman,et al.  A Bayesian Mixed Effects Model of Literary Character , 2014, ACL.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[9]  Catherine Tosenberger,et al.  Homosexuality at the Online Hogwarts: Harry Potter Slash Fanfiction , 2008 .

[10]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[11]  Omer Levy,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP/IJCNLP.

[12]  André F. T. Martins,et al.  A Joint Model for Quotation Attribution and Coreference Resolution , 2014, EACL.

[13]  Angel X. Chang,et al.  A Two-stage Sieve Approach for Quote Attribution , 2017, EACL.

[14]  Bertram C. Bruce A social interaction model of reading , 1981 .

[15]  Jordan L. Boyd-Graber,et al.  Feuding Families and Former Friends: Unsupervised Learning for Dynamic Fictional Relationships , 2016, NAACL.

[16]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[17]  Rita Felski Hooked: Art and Attachment , 2020 .

[18]  Cecilia R. Aragon,et al.  Where No One Has Gone Before: A Meta-Dataset of the World's Largest Fanfiction Repository , 2017, CHI.

[19]  David Bamman,et al.  An Annotated Dataset of Coreference in English Literature , 2020, LREC.

[20]  David Bamman,et al.  Measuring Information Propagation in Literary Social Networks , 2020, EMNLP.

[21]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection , 2018, CLEF.

[22]  Bryan C. Semaan,et al.  "Coming Out Okay": Community Narratives for LGBTQ Identity Recovery Work , 2019, Proc. ACM Hum. Comput. Interact..

[23]  Magdalena Ziętara,et al.  Participant , 2020, Definitions.

[24]  David Bamman,et al.  Literary Event Detection , 2019, ACL.

[25]  Cecilia R. Aragon,et al.  More Than Peer Production: Fanfiction Communities as Sites of Distributed Mentoring , 2016, CSCW.

[26]  Walter J. Scheirer,et al.  Practical Text Phylogeny for Real-World Settings , 2018, IEEE Access.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Michael S. Bernstein,et al.  Shirtless and Dangerous: Quantifying Linguistic Signals of Gender Bias in an Online Fiction Writing Community , 2016, ICWSM.

[29]  Casey Fiesler,et al.  “Participant” Perceptions of Twitter Research Ethics , 2018 .

[30]  James R. Curran,et al.  A Sequence Labelling Approach to Quote Attribution , 2012, EMNLP.

[31]  Denilson Barbosa,et al.  Identification of Speakers in Novels , 2013, ACL.

[32]  Geoff F. Kaufman,et al.  Learning to Listen: Critically Considering the Role of AI in Human Storytelling and Character Creation , 2018 .

[33]  Roman Klinger,et al.  Frowning Frodo, Wincing Leia, and a Seriously Great Friendship: Learning to Classify Emotional Relationships of Fictional Characters , 2019, NAACL.

[34]  Owen Rambow,et al.  Automatic Extraction of Social Networks from Literary Text: A Case Study on Alice in Wonderland , 2013, IJCNLP.

[35]  Christopher Potts,et al.  The Life and Death of Discourse Entities: Identifying Singleton Mentions , 2013, NAACL.

[36]  Geoff F. Kaufman,et al.  Understanding Media Enjoyment: The Role of Transportation Into Narrative Worlds , 2004 .

[37]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[38]  David Vilares,et al.  Harry Potter and the Action Prediction Challenge from Natural Language , 2019, NAACL.

[39]  Hugh M. Culbertson Textual poachers: Television fans and participatory culture: Henry Jenkins, New York: Routledge, 1992, 343 pp , 1993 .

[40]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[41]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[42]  Christopher D. Manning,et al.  Entity-Centric Coreference Resolution with Model Stacking , 2015, ACL.