Cancer Research Trend Analysis Based on Fusion Feature Representation

Machine learning models can automatically discover biomedical research trends and promote the dissemination of information and knowledge. Text feature representation is a critical and challenging task in natural language processing. Most methods of text feature representation are based on word representation. A good representation can capture semantic and structural information. In this paper, two fusion algorithms are proposed, namely, the Tr-W2v and Ti-W2v algorithms. They are based on the classical text feature representation model and consider the importance of words. The results show that the effectiveness of the two fusion text representation models is better than the classical text representation model, and the results based on the Tr-W2v algorithm are the best. Furthermore, based on the Tr-W2v algorithm, trend analyses of cancer research are conducted, including correlation analysis, keyword trend analysis, and improved keyword trend analysis. The discovery of the research trends and the evolution of hotspots for cancers can help doctors and biological researchers collect information and provide guidance for further research.

[1]  Fausto Giunchiglia,et al.  Deep Feature-Based Text Clustering and its Explanation , 2022, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dong Xu,et al.  Trends in Alzheimer's Disease Research Based upon Machine Learning Analysis of PubMed Abstracts , 2019, International journal of biological sciences.

[3]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[4]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[5]  Jun Guo,et al.  A novel negative sampling based on TFIDF for learning word representation , 2016, Neurocomputing.

[6]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[7]  Senator Jeong,et al.  Structuralizing biomedical abstracts with discriminative linguistic features , 2016, Comput. Biol. Medicine.

[8]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[9]  Ramiz M. Aliguliyev,et al.  Performance evaluation of density-based clustering methods , 2009, Inf. Sci..

[10]  Donghui Wang,et al.  A content-based recommender system for computer science publications , 2018, Knowl. Based Syst..

[11]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[12]  Amir Hussain,et al.  Deep Neural Network Driven Binaural Audio Visual Speech Separation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[13]  Zhiyong Lu,et al.  Evaluation of query expansion using MeSH in PubMed , 2009, Information Retrieval.

[14]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[15]  K. Thangavel,et al.  Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[16]  M. Mahoney,et al.  Colorectal cancer occurs earlier in those exposed to tobacco smoke: implications for screening , 2008, Journal of Cancer Research and Clinical Oncology.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[20]  C. Ulrich,et al.  Physical activity and risks of breast and colorectal cancer: a Mendelian randomisation analysis , 2020, Nature Communications.

[21]  Huiling Chen,et al.  Chaotic multi-swarm whale optimizer boosted support vector machine for medical diagnosis , 2020, Appl. Soft Comput..

[22]  Yan Chen,et al.  A hotspots analysis-relation discovery representation model for revealing diabetes mellitus and obesity , 2018, BMC Systems Biology.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  I. Domysławska,et al.  Role of periostin in esophageal, gastric and colon cancer. , 2016, Oncology letters.

[25]  Casey S. Greene,et al.  Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery , 2015, Briefings Bioinform..

[26]  Ankur Datta,et al.  A Content-based Recommender System for E-commerce Offers and Coupons , 2017, eCOM@SIGIR.

[27]  E. Gundogan,et al.  Totally laparoscopic resection and extraction of specimens via transanal route in synchronous colon and gastric cancer. , 2018, Il Giornale di chirurgia.

[28]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[29]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[30]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.