Transferring Knowledge from Text to Predict Disease Onset

In many domains such as medicine, training data is in short supply. In such cases, external knowledge is often helpful in building predictive models. We propose a novel method to incorporate publicly available domain expertise to build accurate models. Specifically, we use word2vec models trained on a domain-specific corpus to estimate the relevance of each feature's text description to the prediction problem. We use these relevance estimates to rescale the features, causing more important features to experience weaker regularization. We apply our method to predict the onset of five chronic diseases in the next five years in two genders and two age groups. Our rescaling approach improves the accuracy of the model, particularly when there are few positive examples. Furthermore, our method selects 60% fewer features, easing interpretation by physicians. Our method is applicable to other domains where feature and outcome descriptions are available.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[3]  D. Lloyd‐Jones,et al.  Cardiovascular risk prediction: basic concepts, current status, and future directions. , 2010, Circulation.

[4]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Zeeshan Syed,et al.  Adapting Surgical Models to Individual Hospitals Using Transfer Learning , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[7]  A. Siu Screening for Abnormal Blood Glucose and Type 2 Diabetes Mellitus: U.S. Preventive Services Task Force Recommendation Statement. , 2015, Annals of internal medicine.

[8]  Girish N. Nadkarni,et al.  Leveraging hierarchy in medical codes for predictive modeling , 2014, BCB.

[9]  Jenna Wiens,et al.  A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions , 2014, J. Am. Medical Informatics Assoc..

[10]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[11]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[12]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[15]  Peter Szolovits,et al.  Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources , 2015, J. Am. Medical Informatics Assoc..

[16]  John V. Guttag,et al.  Instance Weighting for Patient-Specific Risk Stratification Models , 2015, KDD.

[17]  Bernard R. Rosner,et al.  Fundamentals of Biostatistics. , 1992 .

[18]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[19]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[20]  Ching-Lan Cheng,et al.  Validation of the national health insurance research database with ischemic stroke cases in Taiwan , 2011, Pharmacoepidemiology and drug safety.

[21]  Cynthia Brandt,et al.  Ontology-guided feature engineering for clinical text classification , 2012, J. Biomed. Informatics.

[22]  Ping Zhang,et al.  Clinical risk prediction with multilinear sparse logistic regression , 2014, KDD.