论文信息 - Extraction of disease-related genes from PubMed paper using word2vec

Extraction of disease-related genes from PubMed paper using word2vec

Finding disease-related genes is important in drug discovery. Many genes are involved in the disease, and many studies have been conducted and reported for each disease. However, it is very costly to check these one by one. Therefore, machine learning is a suitable method to address this problem. By extracting study results from research papers by text mining, it is possible to make use of that knowledge. In this research, we aim to extract disease-related genes from PubMed papers using word2vec, which is a text mining method. The method extracts the top 10 genes whose known disease genes and vectors are close to those obtained by word2vec. Based on these, genes other than known disease-related genes are extracted and used as disease-related genes. We conducted experiments using schizophrenia, and confirmed the likelihood of this disease-related gene using xgboost. Pattern 1: Only known genes. Pattern 2: Pattern 1 plus disease-related genes extracted in this study. Pattern 3: Pattern 1 plus the same number of random genes. Using these three patterns, we performed a xgboost with microarray data and compared the classification accuracy. The result was that Pattern 2 had the highest accuracy. Therefore, we could extract genes with using genes related to disease by our method.

Hayato Ohwada | Takahiro Koiwa

[1] Giorgios Kollias,et al. Context-Specific Recommendation System for Predicting Similar PubMed Articles , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[2] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3] Paloma Martínez,et al. Exploring Word Embedding for Drug Name Recognition , 2015, Louhi@EMNLP.

[4] Omer Levy,et al. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[5] Matthias Samwald,et al. Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation , 2015, ArXiv.

[6] Hisham Al-Mubaid,et al. A New Text Mining Approach for Finding Protein-to-Disease Associations , 2005 .

[7] Sampo Pyysalo,et al. How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.