Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation

Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly capture the temporal dynamics of scientific term relations and also assume unobserved connections to be irrelevant (i.e., in a positive-negative (PN) learning setting). To break these limits, we formulate this HG problem as future connectivity prediction task on a dynamic attributed graph via positive-unlabeled (PU) learning. Then, the key is to capture the temporal evolution of node pair (term pair) relations from just the positive and unlabeled data. We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.

[1]  Lu Liu,et al.  Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF , 2014, J. Inf. Sci. Eng..

[2]  Nagarajan Natarajan,et al.  PU Learning for Matrix Completion , 2014, ICML.

[3]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[4]  Guy Richards,et al.  Tobacco smoking and COVID-19 infection , 2020, The Lancet Respiratory Medicine.

[5]  Yuichi Kubota,et al.  Potential role of zinc supplementation in prophylaxis and treatment of COVID-19 , 2020, Medical Hypotheses.

[6]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[7]  R. Derwand,et al.  Does zinc supplementation enhance the clinical efficacy of chloroquine/hydroxychloroquine to win today's battle against COVID-19? , 2020, Medical Hypotheses.

[8]  Ilya Safro,et al.  MOLIERE: Automatic Biomedical Hypothesis Generation System , 2017, KDD.

[9]  Matthew D. Park,et al.  Macrophages: a Trojan horse in COVID-19? , 2020, Nature Reviews Immunology.

[10]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[11]  Rohini K. Srihari,et al.  Use of ranked cross document evidence trails for hypothesis generation , 2007, KDD '07.

[12]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[13]  Ambuj Tewari,et al.  Mixture Proportion Estimation via Kernel Embeddings of Distributions , 2016, ICML.

[14]  Dacheng Tao,et al.  Loss Decomposition and Centroid Estimation for Positive and Unlabeled Learning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jian Yang,et al.  Positive and Unlabeled Learning via Loss Decomposition and Centroid Estimation , 2018, IJCAI.

[16]  Wenkai Li,et al.  A Positive and Unlabeled Learning Algorithm for One-Class Classification of Remote-Sensing Data , 2011, IEEE Transactions on Geoscience and Remote Sensing.

[17]  Aaron Klein,et al.  Towards Automatically-Tuned Deep Neural Networks , 2019, Automated Machine Learning.

[18]  S. Baek,et al.  Enriching plausible new hypothesis generation in PubMed , 2017, PloS one.

[19]  Xiangliang Zhang,et al.  T-PAIR: Temporal Node-pair Embedding for Automatic Biomedical Hypothesis Generation , 2020 .

[20]  Shinji Makino,et al.  Isolation and characterization of SARS-CoV-2 from the first US COVID-19 patient , 2020, bioRxiv.

[21]  Weiming Yuan,et al.  Mathematical modeling of interaction between innate and adaptive immune responses in COVID‐19 and implications for viral pathogenesis , 2020, Journal of medical virology.

[22]  Jacob G. Foster,et al.  Weaving the fabric of science: Dynamic network models of science's unfolding structure , 2015, Soc. Networks.

[23]  Jesse Davis,et al.  Learning from positive and unlabeled data: a survey , 2018, Machine Learning.

[24]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[25]  Miriam Merad,et al.  Pathological inflammation in patients with COVID-19: a key role for monocytes and macrophages , 2020, Nature Reviews Immunology.

[26]  Gang Niu,et al.  Analysis of Learning from Positive and Unlabeled Data , 2014, NIPS.

[27]  Michael Schroeder,et al.  Discovering relations between indirectly connected biomedical concepts , 2014, DILS.

[28]  R. Schwartz,et al.  Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19 , 2020, Cell.

[29]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[30]  Scott Spangler,et al.  Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation , 2015 .

[31]  Nadine Girard,et al.  COVID-19—White matter and globus pallidum lesions , 2020, Neurology: Neuroimmunology & Neuroinflammation.

[32]  Gang Niu,et al.  Class-prior estimation for learning from positive and unlabeled data , 2016, Machine Learning.

[33]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[34]  Aidong Zhang,et al.  Generating Medical Hypotheses Based on Evolutionary Medical Concepts , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[35]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[36]  Jesse Davis,et al.  Estimating the Class Prior in Positive and Unlabeled Data Through Decision Tree Induction , 2018, AAAI.

[37]  Rémi Gilleron,et al.  Positive and Unlabeled Examples Help Learning , 1999, ALT.

[38]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[39]  Martha White,et al.  Estimating the class prior and posterior from noisy positives and unlabeled data , 2016, NIPS.

[40]  P. Kuperan,et al.  COVID‐19 and mycoplasma pneumoniae coinfection , 2020, American journal of hematology.

[41]  Jesse Davis,et al.  Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data , 2018, ECML/PKDD.

[42]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[43]  V. Cagno,et al.  SARS-CoV-2 cellular tropism , 2020, The Lancet Microbe.

[44]  Guangxu Xun,et al.  Hypothesis Generation From Text Based On Co-Evolution Of Biomedical Concepts , 2019, KDD.

[45]  Palash Goyal,et al.  dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning , 2018, Knowl. Based Syst..

[46]  G. Nicolson,et al.  COVID-19 Coronavirus: Is Infection along with Mycoplasma or Other Bacteria Linked to Progression to a Lethal Outcome? , 2020 .

[47]  Peter J. Haas,et al.  Automated hypothesis generation based on mining scientific literature , 2014, KDD.

[48]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[49]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .