An analysis of full-size Russian complexly NER labelled corpus of Internet user reviews on the drugs based on deep learning and language neural nets

We present the full-size Russian complexly NER-labeled corpus of Internet user reviews, along with an evaluation of accuracy levels reached on this corpus by a set of advanced deep learning neural networks to extract the pharmacologically meaningful entities from Russian texts. The corpus annotation includes mentions of the following entities: Medication (33005 mentions), Adverse Drug Reaction (1778), Disease (17403), and Note (4490). Two of them – Medication and Disease – comprise a set of attributes. A part of the corpus has the coreference annotation with 1560 coreference chains in 300 documents. Special multi-label model based on a language model and the set of features is developed, appropriate for presented corpus labeling. The influence of the choice of different modifications of the models: word vector representations, types of language models pre-trained for Russian, text normalization styles, and other preliminary processing are analyzed. The sufficient size of our corpus allows to study the effects of particularities of corpus labeling and balancing entities in the corpus. As a result, the state of the art for the pharmacological entity extraction problem for Russian is established on a full-size labeled corpus. In case of the adverse drug reaction (ADR) recognition, it is 61.1 by the F1-exact metric that, as our analysis shows, is on par with the accuracy level for other language corpora with similar characteristics and the ADR representativnes. The evaluated baseline precision of coreference relation extraction on the corpus is 71, that is higher the results reached on other Russian corpora.

[1]  Mike Conway,et al.  The PsyTAR dataset: From patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications , 2019, Data in brief.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Tatiana Litvinova,et al.  Deception detection in Russian texts , 2017, EACL.

[4]  G. Braemer International statistical classification of diseases and related health problems. Tenth revision. , 1988, World health statistics quarterly. Rapport trimestriel de statistiques sanitaires mondiales.

[5]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[6]  Abeed Sarker,et al.  Social Media Mining Shared Task Workshop , 2016, PSB.

[7]  Sarvnaz Karimi,et al.  Cadec: A corpus of adverse drug event annotations , 2015, J. Biomed. Informatics.

[8]  Sophia Ananiadou,et al.  Annotation and detection of drug effects in text for pharmacovigilance , 2018, Journal of Cheminformatics.

[9]  Proceedings of the 3rd Clinical Natural Language Processing Workshop, ClinicalNLP@EMNLP 2020, Online, November 19, 2020 , 2020, ClinicalNLP@EMNLP.

[10]  Vasudeva Varma,et al.  Multi-task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets , 2018, ECIR.

[11]  Omer Levy,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP/IJCNLP.

[12]  Tuomo Kakkonen,et al.  Investigating the Role of Emotion-Based Features in Author Gender Classification of Text , 2014, CICLing.

[13]  Yin Zhang,et al.  Lexicon Knowledge Boosted Interaction Graph Network for Adverse Drug Reaction Recognition from Social Media. , 2020, IEEE journal of biomedical and health informatics.

[14]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Toldova S. Ju,et al.  RU-EVAL-2019: EVALUATING ANAPHORA AND COREFERENCE RESOLUTION FOR RUSSIAN , 2014 .

[16]  Maria Vasilyeva,et al.  Evaluating Anaphora and Coreference Resolution for Russian , 2014 .

[17]  Nigel Collier,et al.  BioReddit: Word Embeddings for User-Generated Biomedical NLP , 2019, LOUHI@EMNLP.

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[20]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[21]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[22]  Yang Xiang,et al.  Exploiting adversarial transfer learning for adverse drug reaction detection from texts , 2020, J. Biomed. Informatics.

[23]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[24]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[25]  H Britt,et al.  A new drug classification for computer systems: the ATC extension code. , 1995, International journal of bio-medical computing.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[29]  Abeed Sarker,et al.  Portable automatic text classification for adverse drug reaction detection via multi-corpus training , 2015, J. Biomed. Informatics.

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Varvara Logacheva,et al.  DeepPavlov: Open-Source Library for Dialogue Systems , 2018, ACL.

[32]  Vasudeva Varma,et al.  Co-training for Extraction of Adverse Drug Reaction Mentions from Tweets , 2018, ECIR.

[33]  Yusuke Miyao,et al.  TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations , 2017, JMIR public health and surveillance.

[34]  Liyan Xu,et al.  Revealing the Myth of Higher-Order Inference in Coreference Resolution , 2020, EMNLP.

[35]  Elena Tutubalina,et al.  The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews , 2020, Bioinform..

[36]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[38]  Information extraction from clinical texts in Russian , 2015 .

[39]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[40]  WENTING WANG MINING ADVERSE DRUG REACTION MENTIONS IN TWITTER WITH WORD EMBEDDINGS , 2015 .

[41]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[42]  Allyson Ettinger,et al.  Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks , 2020, EMNLP.

[43]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[44]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[45]  Alexander Sboev,et al.  A Quantitative Method of Text Emotiveness Evaluation on Base of the Psycholinguistic Markers Founded on Morphological Features , 2015 .

[46]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[47]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[48]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..

[49]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.