ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study

BACKGROUND In recent years, with increases in the amount of information available and the importance of information screening, increased attention has been paid to the calculation of textual semantic similarity. In the field of medicine, electronic medical records and medical research documents have become important data resources for clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved. OBJECTIVE This research aims to solve 2 problems-(1) when the size of medical data sets is small, leading to insufficient learning with understanding of the models and (2) when information is lost in the process of long-distance propagation, causing the models to be unable to grasp key information. METHODS This paper combines a text data augmentation method and a self-ensemble ALBERT model under semisupervised learning to perform clinical textual semantic similarity calculations. RESULTS Compared with the methods in the 2019 National Natural Language Processing Clinical Challenges Open Health Natural Language Processing shared task Track on Clinical Semantic Textual Similarity, our method surpasses the best result by 2 percentage points and achieves a Pearson correlation coefficient of 0.92. CONCLUSIONS When the size of medical data set is small, data augmentation can increase the size of the data set and improved semisupervised learning can boost the learning efficiency of the model. Additionally, self-ensemble methods improve the model performance. Our method had excellent performance and has great potential to improve related medical problems.

[1]  Qingcai Chen,et al.  Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text , 2020, EMNLP.

[2]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[3]  Min Feng,et al.  Question Similarity Calculation for FAQ Answering , 2007, Third International Conference on Semantics, Knowledge and Grid (SKG 2007).

[4]  Feichen Shen,et al.  The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview , 2020, JMIR medical informatics.

[5]  Jaewoo Kang,et al.  Classification of lung nodules in CT scans using three-dimensional deep convolutional neural networks with a checkpoint ensemble method , 2018, BMC Medical Imaging.

[6]  Erik Cambria,et al.  Semi-supervised learning for big social data analysis , 2018, Neurocomputing.

[7]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Daniel Jurafsky,et al.  Data Noising as Smoothing in Neural Network Language Models , 2017, ICLR.

[9]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[10]  Liangjun Li,et al.  An Improved Text Similarity Calculation Algorithm Based on VSM , 2011 .

[11]  Jordon Ritchie,et al.  Categorization of Third-Party Apps in Electronic Health Record App Marketplaces: Systematic Search and Analysis , 2020, JMIR medical informatics.

[12]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[13]  Yunpeng Wang,et al.  Long short-term memory neural network for traffic speed prediction using remote microwave sensor data , 2015 .

[14]  Hong Qing Yu,et al.  Dynamic Causality Knowledge Graph Generation for Supporting the Chatbot Healthcare System , 2020 .

[15]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20]  R. Eisinga,et al.  The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown? , 2013, International Journal of Public Health.

[21]  Yiteng Pan,et al.  A novel Enhanced Collaborative Autoencoder with knowledge distillation for top-N recommender systems , 2019, Neurocomputing.

[22]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[23]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[24]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[25]  Maciej Wielgosz,et al.  Energy Efficient Calculations of Text Similarity Measure on FPGA-Accelerated Computing Platforms , 2015, PPAM.

[26]  Qingcai Chen,et al.  Distributed representation and one-hot representation fusion with gated network for clinical semantic textual similarity , 2020, BMC Medical Informatics and Decision Making.

[27]  Bo Du,et al.  Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art , 2016, IEEE Geoscience and Remote Sensing Magazine.

[28]  Huijun Gao,et al.  A convolutional neural network based on a capsule network with strong generalization for bearing fault diagnosis , 2019, Neurocomputing.

[29]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[30]  Gang Liu,et al.  Short text similarity based on probabilistic topics , 2009, Knowledge and Information Systems.

[31]  Hongfang Liu,et al.  BioCreative/OHNLP Challenge 2018 , 2018, BCB.

[32]  Bin Wang,et al.  A Gated Dilated Convolution with Attention Model for Clinical Cloze-Style Reading Comprehension , 2020, International journal of environmental research and public health.

[33]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[34]  Ali Gholipour,et al.  Semi-Supervised Learning With Deep Embedded Clustering for Image Classification and Segmentation , 2019, IEEE Access.

[35]  Cheng Wu,et al.  Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.