TermInformer: unsupervised term mining and analysis in biomedical literature

Terminology is the most basic information that researchers and literature analysis systems need to understand. Mining terms and revealing the semantic relationships between terms can help biomedical researchers find solutions to some major health problems and motivate researchers to explore innovative biomedical research issues. However, how to mine terms from biomedical literature remains a challenge. At present, the research on text segmentation in natural language processing (NLP) technology has not been well applied in the biomedical field. Named entity recognition models usually require a large amount of training corpus, and the types of entities that the model can recognize are limited. Besides, dictionary-based methods mainly use pre-established vocabularies to match the text. However, this method can only match terms in a specific field, and the process of collecting terms is time-consuming and labour-intensive. Many scenarios faced in the field of biomedical research are unsupervised, i.e. unlabelled corpora, and the system may not have much prior knowledge. This paper proposes the TermInformer project, which aims to mine the meaning of terms in an open fashion by calculating terms and find solutions to some of the significant problems in our society. We propose an unsupervised method that can automatically mine terms in the text without relying on external resources. Our method can generally be applied to any document data. Combined with the word vector training algorithm, we can obtain reusable term embeddings, which can be used in any NLP downstream application. This paper compares term embeddings with existing word embeddings. The results show that our method can better reflect the semantic relationship between terms. Finally, we use the proposed method to find potential factors and treatments for lung cancer, breast cancer, and coronavirus.

[1]  Salvatore Cuomo,et al.  Decision Making in IoT Environment through Unsupervised Learning , 2020, IEEE Intelligent Systems.

[2]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[3]  Yu Zhang,et al.  Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning , 2018, bioRxiv.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Amit Kumar Jaiswal,et al.  Identifying pneumonia in chest X-rays: A deep learning approach , 2019, Measurement.

[6]  M. Shamim Hossain,et al.  Relational User Attribute Inference in Social Media , 2015, IEEE Transactions on Multimedia.

[7]  Xiaolin Yang,et al.  The cell line ontology-based representation, integration and analysis of cell lines used in China , 2019, BMC Bioinformatics.

[8]  Arun Kumar Sangaiah,et al.  Convergence of IoT and product lifecycle management in medical health care , 2018, Future Gener. Comput. Syst..

[9]  Jaewoo Kang,et al.  CollaboNet: collaboration of deep neural networks for biomedical named entity recognition , 2018, BMC Bioinformatics.

[10]  Ali Hassan Sodhro,et al.  Power Control Algorithms for Media Transmission in Remote Healthcare Systems , 2018, IEEE Access.

[11]  Ali Hassan Sodhro,et al.  An adaptive QoS computation for medical data processing in intelligent healthcare applications , 2019, Neural Computing and Applications.

[12]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[13]  Massimo Melucci,et al.  Towards a Quantum-Inspired Framework for Binary Classification , 2018, CIKM.

[14]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[17]  Joel J. P. C. Rodrigues,et al.  Detection of subtype blood cells using deep learning , 2018, Cognitive Systems Research.

[18]  Massimo Melucci,et al.  Binary Classifier Inspired by Quantum Theory , 2019, AAAI.

[19]  Victor Hugo C. De Albuquerque,et al.  Health of Things Algorithms for Malignancy Level Classification of Lung Nodules , 2018, IEEE Access.

[20]  Zhiyu Chen,et al.  A blockchain-based eHealthcare system interoperating with WBANs , 2020, Future Gener. Comput. Syst..

[21]  Ke Xu,et al.  Multitask learning for biomedical named entity recognition with cross-sharing structure , 2019, BMC Bioinformatics.

[22]  Sahil Garg,et al.  Structural block driven enhanced convolutional neural representation for relation extraction , 2020, Appl. Soft Comput..

[23]  Kashif Naseer Qureshi,et al.  An accurate and dynamic predictive model for a smart M-Health system using machine learning , 2020, Inf. Sci..

[24]  Salvatore Cuomo,et al.  Exploring Unsupervised Learning Techniques for the Internet of Things , 2020, IEEE Transactions on Industrial Informatics.

[25]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[26]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[27]  Gwanggil Jeon,et al.  A Sustainable Solution to Support Data Security in High Bandwidth Healthcare Remote Locations by Using TCP CUBIC Mechanism , 2020, IEEE Transactions on Sustainable Computing.

[28]  Hyunju Lee,et al.  Biomedical named entity recognition using deep neural networks with contextual information , 2019, BMC Bioinformatics.

[29]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[30]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[33]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[34]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[35]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[36]  Albert Y. Zomaya,et al.  A Hybrid Deep Learning-Based Model for Anomaly Detection in Cloud Datacenter Networks , 2019, IEEE Transactions on Network and Service Management.

[37]  Salvatore Cuomo,et al.  A machine learning approach for IoT cultural data , 2019, Journal of Ambient Intelligence and Humanized Computing.

[38]  Joel J. P. C. Rodrigues,et al.  Hybrid Deep-Learning-Based Anomaly Detection Scheme for Suspicious Flow Detection in SDN: A Social Multimedia Perspective , 2019, IEEE Transactions on Multimedia.

[39]  Georges Kaddoum,et al.  Decision-Making Model for Securing IoT Devices in Smart Industries , 2021, IEEE Transactions on Industrial Informatics.

[40]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[41]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[42]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[43]  Hari Mohan Pandey,et al.  A Noble Double-Dictionary-Based ECG Compression Technique for IoTH , 2020, IEEE Internet of Things Journal.

[44]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[45]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[46]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[47]  Massimo Melucci,et al.  Towards a Quantum-Inspired Binary Classifier , 2019, IEEE Access.

[48]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[49]  Yaohong Jin,et al.  A distant supervision method based on paradigmatic relations for learning word embeddings , 2019, Neural Computing and Applications.

[50]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[51]  Hong Liu,et al.  Biomedical Named Entity Recognition based on Deep Neutral Network , 2015 .

[52]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[53]  Sung Wook Baik,et al.  Mobile edge computing based QoS optimization in medical healthcare applications , 2019, Int. J. Inf. Manag..

[54]  Massimo Melucci,et al.  Multi-class Classification Model Inspired by Quantum Detection Theory , 2018, ArXiv.

[55]  Rajiv Ranjan,et al.  SAFE: SDN-Assisted Framework for Edge–Cloud Interplay in Secure Healthcare Ecosystem , 2019, IEEE Transactions on Industrial Informatics.