Searching for Evidence of Scientific News in Scholarly Big Data

Public digital media can often mix factual information with fake scientific news, which is typically difficult to pinpoint, especially for non-professionals. These scientific news articles create illusions and misconceptions, thus ultimately influence the public opinion, with serious consequences at a broader social scale. Yet, existing solutions aiming at automatically verifying the credibility of news articles are still unsatisfactory. We propose to verify scientific news by retrieving and analyzing its most relevant source papers from an academic digital library (DL), e.g., arXiv. Instead of querying keywords or regular named entities extracted from news articles, we query domain knowledge entities (DKEs) extracted from the text. By querying each DKE, we retrieve a list of candidate scholarly papers. We then design a function to rank them and select the most relevant scholarly paper. After exploring various representations, experiments indicate that the term frequency-inverse document frequency (TF-IDF) representation with cosine similarity outperforms baseline models based on word embedding. This result demonstrates the efficacy of using DKEs to retrieve scientific papers which are relevant to a specific news article. It also indicates that word embedding may not be the best document representation for domain specific document retrieval tasks. Our method is fully automated and can be effectively applied to facilitating fake and misinformed news detection across many scientific domains.

[1]  Zhou Yu,et al.  Cross-Lingual Cross-Platform Rumor Verification Pivoting on Multimedia Content , 2018, EMNLP.

[2]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[3]  Dietram A. Scheufele,et al.  Science audiences, misinformation, and fake news , 2019, Proceedings of the National Academy of Sciences.

[4]  Rahul Gupta,et al.  Multimodal detection of fake social media use through a fusion of classification and pairwise ranking systems , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[5]  Muhammad Qaiser Saleem,et al.  Framework for Rumors Detection in Social Media , 2018 .

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  C. Lee Giles,et al.  HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[8]  Yimin Chen,et al.  Automatic deception detection: Methods for finding fake news , 2015, ASIST.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  L. Given Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community , 2015 .

[11]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[12]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[15]  Srikanta J. Bedathur,et al.  Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance , 2018, CIKM.

[16]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[17]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[18]  Didi Surian,et al.  Recommending research articles to consumers of online vaccination information , 2019, Quantitative Science Studies.

[19]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.