A Cross-lingual Natural Language Processing Framework for Infodemic Management

The COVID-19 pandemic has put immense pressure on health systems which are further strained due to the misinformation surrounding it. Under such a situation, providing the right information at the right time is crucial. There is a growing demand for the management of information spread using Artificial Intelligence. Hence, we have exploited the potential of Natural Language Processing for identifying relevant information that needs to be disseminated amongst the masses. In this work, we present a novel Cross-lingual Natural Language Processing framework to provide relevant information by matching daily news with trusted guidelines from the World Health Organization. The proposed pipeline deploys various techniques of NLP such as summarizers, word embeddings, and similarity metrics to provide users with news articles along with a corresponding healthcare guideline. A total of 36 models were evaluated and a combination of LexRank based summarizer on Word2Vec embedding with Word Mover distance metric outperformed all other models. This novel open-source approach can be used as a template for proactive dissemination of relevant healthcare information in the midst of misinformation spread associated with epidemics.

[1]  Luis Argerich,et al.  Variations of the Similarity Function of TextRank for Automated Summarization , 2016, ArXiv.

[2]  Shaghayegh Haghjooy Javanmard,et al.  Infodemic and Risk Communication in the Era of CoV-19 , 2020, Advanced biomedical research.

[3]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[4]  Emma Stoye China coronavirus: how many papers have been published? , 2020, Nature.

[5]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[6]  Hongyan Jing,et al.  Sentence Reduction for Automatic Text Summarization , 2000, ANLP.

[7]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[8]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[9]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Adhistya Erna Permanasari,et al.  Cosine similarity to determine similarity measure: Study case in online essay assessment , 2016, 2016 4th International Conference on Cyber and IT Service Management.

[13]  Ferda Nur Alpaslan,et al.  Text summarization using Latent Semantic Analysis , 2011, J. Inf. Sci..

[14]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Jagadeesh Jagarlamudi,et al.  Summarization Approaches Based on Document Probability Distributions , 2009, PACLIC.