Razmecheno: Named Entity Recognition from Digital Archive of Diaries “Prozhito”

Timofey Atnashev♡, Veronika Ganeeva♡, Roman Kazakov♡ Daria Matyash♡$, Michael Sonkin♡, Ekaterina Voloshina♡ Oleg Serikov♡3‡♯, Ekaterina Artemova♡†♠ ♡ HSE University 3 DeepPavlov lab, MIPT ‡ AIRI ♯ The Institute of Linguistics RAS † Huawei Noah’s Ark Lab ♠ Lomonosov Moscow State University $ Sber AI Centre {taatnashev, vaganeeva, rmkazakov, dsmatyash, mvsonkin, eyuvoloshina}@edu.hse.ru {oserikov, elartemova}@hse.ru Moscow, Russia Abstract The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset “Razmecheno”, gathered from the diary texts of the project “Prozhito” in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First, workers selected sentences, which contain an entity of particular type. Second, they marked up entity spans. As a result 1113 entities were obtained. Empirical evaluation of Razmecheno is carried out with off-the-shelf NER tools and by fine-tuning pre-trained contextualized encoders. We release the annotated dataset for open access.

[1]  Josef Steinberger,et al.  The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages , 2019 .

[2]  Kemal Oflazer,et al.  Recall-Oriented Learning of Named Entities in Arabic Wikipedia , 2012, EACL.

[3]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[4]  David Bamman,et al.  An annotated dataset of literary entities , 2019, North American Chapter of the Association for Computational Linguistics.

[5]  Niccolò Campolungo,et al.  WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER , 2021, EMNLP.

[6]  Christian Biemann,et al.  NoSta-D Named Entity Annotation for German: Guidelines and Dataset , 2014, LREC.

[7]  Suresh Manandhar,et al.  NEREL: A Russian Dataset with Nested Named Entities, Relations and Events , 2021, RANLP.

[8]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[9]  Erik Velldal,et al.  NorNE: Annotating Named Entities for Norwegian , 2020, LREC.

[10]  Timothy Baldwin,et al.  Bootstrapped Text-level Named Entity Recognition for Literature , 2016, ACL.

[11]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[12]  Xu Sun,et al.  A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text , 2017, ArXiv.

[13]  Vladimir Ivanov,et al.  Introducing Baselines for Russian Named Entity Recognition , 2013, CICLing.

[14]  Dmitry I. Ilvovsky,et al.  Extracting Social Networks from Literary Text with Word Embedding Tools , 2016, LT4DH@COLING.

[15]  Abbas Ghaddar,et al.  WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition , 2017, IJCNLP.

[16]  Svetlana Alexeeva,et al.  FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian , 2016 .