COVID-19 multidimensional kaggle literature organization

The unprecedented outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, continues to be a significant worldwide problem. As a result, a surge of new COVID-19 related research has followed suit. The growing number of publications requires document organization methods to identify relevant information. In this paper, we expand upon our previous work with clustering the CORD-19 dataset by applying multi-dimensional analysis methods. Tensor factorization is a powerful unsupervised learning method capable of discovering hidden patterns in a document corpus. We show that a higher-order representation of the corpus allows for the simultaneous grouping of similar articles, relevant journals, authors with similar research interests, and topic keywords. These groupings are identified within and among the latent components extracted via tensor decomposition. We further demonstrate the application of this method with a publicly available interactive visualization of the dataset.

[1]  Edward Raff,et al.  COVID-19 Kaggle Literature Organization , 2020, DocEng.

[2]  Tamara G. Kolda,et al.  Practical Leverage-Based Sampling for Low-Rank Tensor Decomposition , 2020, ArXiv.

[3]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[4]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[5]  Soukaina Filali Boubrahimi,et al.  Tensor Decomposition for Neurodevelopmental Disorder Prediction , 2018, BI.

[6]  Tamara G. Kolda,et al.  A Practical Randomized CP Tensor Decomposition , 2017, SIAM J. Matrix Anal. Appl..

[7]  Spyros Sioutas,et al.  Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents , 2017, Comput..

[8]  Erik Skau,et al.  Distributed Non-Negative Tensor Train Decomposition , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[9]  James P. Smith,et al.  Semantic Nonnegative Matrix Factorization with Automatic Model Determination for Topic Modeling , 2020, 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA).

[10]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[11]  Michael Hucka,et al.  Nostril: A nonsense string evaluator written in Python , 2018, J. Open Source Softw..

[12]  Elizaveta Rebrova,et al.  COVID-19 Literature Topic-Based Search via Hierarchical NMF , 2020, NLP4COVID@EMNLP.

[13]  Qingpeng Zhang,et al.  Tensor Factorization-based Prediction with an Application to Estimating the Risk of Chronic Diseases , 2019, bioRxiv.