Discovering Key Topics From Short, Real-World Medical Inquiries via Natural Language Processing

Millions of unsolicited medical inquiries are received by pharmaceutical companies every year. It has been hypothesized that these inquiries represent a treasure trove of information, potentially giving insight into matters regarding medicinal products and the associated medical treatments. However, due to the large volume and specialized nature of the inquiries, it is difficult to perform timely, recurrent, and comprehensive analyses. Here, we propose a machine learning approach based on natural language processing and unsupervised learning to automatically discover key topics in real-world medical inquiries from customers. This approach does not require ontologies nor annotations. The discovered topics are meaningful and medically relevant, as judged by medical information specialists, thus demonstrating that unsolicited medical inquiries are a source of valuable customer insights. Our work paves the way for the machine-learning-driven analysis of medical inquiries in the pharmaceutical industry, which ultimately aims at improving patient care.

[1]  Bo Zhao,et al.  Deep learning in clinical natural language processing: a methodical review , 2019, J. Am. Medical Informatics Assoc..

[2]  Adler J. Perotte,et al.  Learning probabilistic phenotypes from heterogeneous EHR data , 2015, J. Biomed. Informatics.

[3]  Richard Dobson,et al.  Comparative Analysis of Text Classification Approaches in Electronic Health Records , 2020, BIONLP.

[4]  A. McCray The UMLS Semantic Network. , 1989 .

[5]  Dimo Angelov,et al.  Top2Vec: Distributed Representations of Topics , 2020, ArXiv.

[6]  Craig C. Douglas,et al.  Hierarchical Density-Based Clustering based on GPU Accelerated Data Indexing Strategy. , 2016, ICCS 2016.

[7]  Benjamin S. Glicksberg,et al.  Deep representation learning of electronic health records to unlock patient stratification at scale. , 2020, NPJ digital medicine.

[8]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[9]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[10]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[11]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[12]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[13]  Dietrich Rebholz-Schuhmann,et al.  Deep learning-based clustering approaches for bioinformatics , 2020, Briefings Bioinform..

[14]  Philipp Koehn,et al.  Context and Copying in Neural Machine Translation , 2018, EMNLP.

[15]  Xiao Luo,et al.  Exploring diseases based biomedical document clustering and visualization using self-organizing maps , 2017, 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom).

[16]  Jun Zhang,et al.  Dirichlet Process Mixture Model for Document Clustering with Feature Partition , 2013, IEEE Transactions on Knowledge and Data Engineering.

[17]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[18]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[19]  Li Yun,et al.  Short Text Topic Modeling Techniques, Applications, and Performance: A Survey , 2019, IEEE Transactions on Knowledge and Data Engineering.

[20]  William Speier,et al.  A topic model of clinical reports , 2012, SIGIR '12.

[21]  Cesare Furlanello,et al.  Deep representation learning of electronic health records to unlock patient stratification at scale , 2020, npj Digital Medicine.

[22]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[23]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[24]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[25]  Russ B. Altman,et al.  Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets , 2016, J. Am. Medical Informatics Assoc..

[26]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[27]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[30]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[31]  Xiao Luo,et al.  Biomedical Document Clustering and Visualization based on the Concepts of Diseases , 2018, ArXiv.

[32]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[33]  Sebastián Ventura,et al.  An advanced review on text mining in medicine , 2019, WIREs Data Mining Knowl. Discov..

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[36]  Fang Liu,et al.  A survey of data mining technology on electronic medical records , 2017, 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom).

[37]  Mahananda Nagar Ujjian Survey on Data Mining , 2012 .

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Milad Moradi,et al.  Clustering of Deep Contextualized Representations for Summarization of Biomedical Texts , 2019, ArXiv.

[40]  Peter Szolovits,et al.  Representation Learning for Electronic Health Records , 2019, ArXiv.

[41]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[42]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[43]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[44]  Nemanja Vaci,et al.  Med7: a transferable clinical natural language processing model for electronic health records , 2020, Artif. Intell. Medicine.

[45]  Srinivasan Parthasarathy,et al.  Hierarchical Density-Based Clustering based on GPU Accelerated Data Indexing Strategy , 2016, ICCS.

[46]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[47]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[50]  Jean-Yves Blay,et al.  Evolving role of regorafenib for the treatment of advanced cancers. , 2020, Cancer treatment reviews.

[51]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[52]  Olivier Bodenreider,et al.  Exploring semantic groups through visual approaches , 2003, J. Biomed. Informatics.

[53]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[54]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[55]  Afsaneh Barzi,et al.  Regorafenib dose-optimisation in patients with refractory metastatic colorectal cancer (ReDOS): a randomised, multicentre, open-label, phase 2 study. , 2019, The Lancet. Oncology.

[56]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[57]  Anita Burgun,et al.  Detection of Cases of Noncompliance to Drug Treatment in Patient Forum Posts: Topic Model Approach , 2018, Journal of medical Internet research.

[58]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[59]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.