A method of inferring the relationship between Biomedical entities through correlation analysis on text

BackgroundOne of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases.MethodsIn this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms’ terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity.ResultsIn the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs.ConclusionsThe correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[3]  M. Karas,et al.  Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. , 1988, Analytical chemistry.

[4]  Wilfred H. Nelson,et al.  Physical Methods for Microorganisms Detection , 1991 .

[5]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[6]  M. Mann,et al.  Analysis of proteins and proteomes by mass spectrometry. , 2001, Annual review of biochemistry.

[7]  Mukesh Verma,et al.  Proteomics for Cancer Biomarker Discovery , 2002 .

[8]  David Weenink,et al.  CANONICAL CORRELATION ANALYSIS , 2003 .

[9]  Gary Geunbae Lee,et al.  POSBIOTM-NER in the Shared Task of BioNLP/NLPBA2004 , 2004, NLPBA/BioNLP.

[10]  Richard M Caprioli,et al.  MALDI mass spectrometry for direct tissue analysis: a new tool for biomarker discovery. , 2005, Journal of proteome research.

[11]  N. Kiviat,et al.  Molecular Biomarkers for Cancer Detection in Blood and Bodily Fluids , 2006, Critical reviews in clinical laboratory sciences.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Richard D Beger,et al.  Metabolomics approaches for discovering biomarkers of drug-induced hepatotoxicity and nephrotoxicity. , 2010, Toxicology and applied pharmacology.

[14]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[15]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[16]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[17]  Dekang Lin,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 , 2011 .

[18]  María Pedrero,et al.  Electrochemical genosensors based on PCR strategies for microorganisms detection and quantification , 2011 .

[19]  Kyu-Baek Hwang,et al.  A Bio-Text Mining System Based on Natural Language Processing , 2011 .

[20]  Hui Sun,et al.  Urine Metabolomics Analysis for Biomarker Discovery and Detection of Jaundice Syndrome in Patients With Liver Disease* , 2012, Molecular & Cellular Proteomics.

[21]  Mostafa Rezaei-Tavirani,et al.  Breast Cancer Biomarker Discovery: Proteomics and Genomics Approaches , 2013 .

[22]  Suke Li,et al.  Semi-supervised Sentiment Classification using Ranked Opinion Words , 2013 .

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Chan-Young Park,et al.  Integration of Menopausal Information into the Multiple Biomarker Diagnosis for Early Diagnosis of Ovarian Cancer , 2013 .

[25]  Yong Yu,et al.  Learning Word Representation Considering Proximity and Ambiguity , 2014, AAAI.

[26]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[27]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[28]  Karl Stratos,et al.  Model-based Word Embeddings from Decompositions of Count Matrices , 2015, ACL.

[29]  Hye-Jeong Song,et al.  Comparison of NER Performance Using Word Embedding , 2015 .

[30]  Sunil Kumar Sahu,et al.  Evaluating distributed word representations for capturing semantics of biomedical concepts , 2015, BioNLP@IJCNLP.

[31]  Hye-Jeong Song,et al.  Classification Performance of Bio-Marker and Disease Word using Word Representation Models , 2016 .

[32]  Kyeong-Min Nam,et al.  Detection of Alternative Ovarian Cancer Biomarker via Word Embedding , 2016 .

[33]  Hye-Jeong Song,et al.  Named Entity Recognition using Word Embedding as a Feature , 2016 .