Diagnosis Recommendation Using Machine Learning Scientific Workflows

Diagnosis recommendation plays a significant role in healthcare, where a clinician infers an optimal diagnosis for a patient. This problem has a major impact on improving patients’ quality of life. Existing machine learning techniques for solving this problem require many labeled instances, which are not readily available. To overcome this limitation, in this paper, we present a scientific workflow for representing a semisupervised clustering based diagnosis recommendation model. In this approach, initial clusters are formed from a labeled dataset; then imposing certain relative threshold to a cluster, frequent patterns and their corresponding labels are obtained. Subsequently, unlabeled instances are labeled by assigning them to the most similar clusters. Finally, we form clusters on the generated new datasets and recommend the diagnosis label by applying a certain minimum threshold. To evaluate our model, we perform extensive experiments on the i2b2 datasets and compared our proposed algorithms with the self-training and co-training methods. The experimental results show that our proposed algorithm outperforms the mentioned methods in most cases. The proposed workflow is implemented in the DATAVIEW system.

[1]  V. Burt,et al.  Hypertension among adults in the United States: National Health and Nutrition Examination Survey, 2011-2012. , 2013, NCHS data brief.

[2]  Shiyong Lu,et al.  Big Data Workflows: A Reference Architecture and the DATAVIEW System , 2017 .

[3]  W. Bruce Croft,et al.  Research Paper: Ad Hoc Classification of Radiology Reports , 1999, J. Am. Medical Informatics Assoc..

[4]  Melonie P. Heron Deaths: Leading Causes for 2011. , 2015, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[5]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.

[6]  M. Roizen Forecasting the Future of Cardiovascular Disease in the United States: A Policy Statement From the American Heart Association , 2012 .

[7]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[8]  Donghai Guan,et al.  SMS Classification Based on Naïve Bayes Classifier and Apriori Algorithm Frequent Itemset , 2014 .

[9]  Daniel Crawl,et al.  Integrated Machine Learning in the Kepler Scientific Workflow System , 2016, ICCS.

[10]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[11]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Robert G. Reynolds,et al.  TPS : A Task Placement Strategy for Big Data Workflows , 2015 .

[14]  Stephen Grossberg,et al.  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps , 1992, IEEE Trans. Neural Networks.

[15]  Xiaojin Zhu,et al.  Semi-Supervised Learning Tutorial , 2007 .

[16]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[17]  Hong Zhao,et al.  Supervised Machine Learning Model for High Dimensional Gene Data in Colon Cancer Detection , 2015, 2015 IEEE International Congress on Big Data.

[18]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[19]  Donghai Guan,et al.  Semi-supervised learning using frequent itemset and ensemble learning for SMS classification , 2015, Expert Syst. Appl..

[20]  Shiyong Lu,et al.  A System Architecture for Running Big Data Workflows in the Cloud , 2014, 2014 IEEE International Conference on Services Computing.

[21]  K. Flegal,et al.  Prevalence of childhood and adult obesity in the United States, 2011-2012. , 2014, JAMA.

[22]  Alexander Kotov,et al.  A NoSQL Data Model for Scalable Big Data Workflow Execution , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[23]  George Michailidis,et al.  Graph-Based Semi-Supervised Learning With Big Data , 2020, Cognitive Analytics.

[24]  S. Sathiya Keerthi,et al.  Branch and Bound for Semi-Supervised Support Vector Machines , 2006, NIPS.

[25]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[26]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[27]  Partha Pratim Talukdar,et al.  Graph-Based Semi-Supervised Learning , 2014, Graph-Based Semi-Supervised Learning.

[28]  Jia Zhang,et al.  Predicting efficacy of therapeutic services for autism spectrum disorder using scientific workflows , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[29]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.