Semi-supervised learning of the electronic health record for phenotype stratification

Patient interactions with health care providers result in entries to electronic health records (EHRs). EHRs were built for clinical and billing purposes but contain many data points about an individual. Mining these records provides opportunities to extract electronic phenotypes, which can be paired with genetic data to identify genes underlying common human diseases. This task remains challenging: high quality phenotyping is costly and requires physician review; many fields in the records are sparsely filled; and our definitions of diseases are continuing to improve over time. Here we develop and evaluate a semi-supervised learning method for EHR phenotype extraction using denoising autoencoders for phenotype stratification. By combining denoising autoencoders with random forests we find classification improvements across multiple simulation models and improved survival prediction in ALS clinical trial data. This is particularly evident in cases where only a small number of patients have high quality phenotypes, a common scenario in EHR-based research. Denoising autoencoders perform dimensionality reduction enabling visualization and clustering for the discovery of new subtypes of disease. This method represents a promising approach to clarify disease subtypes and improve genotype-phenotype association studies that leverage EHRs.

[1]  Guergana K. Savova,et al.  Semi-supervised Learning for Phenotyping Tasks , 2015, AMIA.

[2]  Juan Carlos Fernández,et al.  Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms , 2014, Ann. Oper. Res..

[3]  Eric M. Morrow,et al.  A Genome-wide Association Study of Autism Using the Simons Simplex Collection: Does Reducing Phenotypic Heterogeneity in Autism Increase Genetic Homogeneity? , 2015, Biological Psychiatry.

[4]  Aurélie Labbe,et al.  Symptom dimensions as alternative phenotypes to address genetic heterogeneity in schizophrenia and bipolar disorder , 2012, European Journal of Human Genetics.

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[7]  Brett Beaulieu-Jones Denoising Autoencoders for Phenotype Stratification (DAPS): Preprint Release , 2016 .

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Sridhar Ramaswamy,et al.  Patient-derived models of acquired resistance can identify effective drug combinations for cancer , 2014, Science.

[10]  Benjamin S. Glicksberg,et al.  Identification of type 2 diabetes subgroups through topological analysis of patient similarity , 2015, Science Translational Medicine.

[11]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[12]  John Shawe-Taylor,et al.  Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning , 2012, PloS one.

[13]  Alessandra Renieri,et al.  FOXG1 is responsible for the congenital variant of Rett syndrome. , 2008, American journal of human genetics.

[14]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[15]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  C. Dion,et al.  Negative, psychoticism, and disorganized dimensions in patients with familial schizophrenia or bipolar disorder: continuity and discontinuity between the major psychoses. , 1995, The American journal of psychiatry.

[18]  R. Steinbrook Health care and the American Recovery and Reinvestment Act. , 2009, The New England journal of medicine.

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  L KREYBERG,et al.  Histological lung cancer types. A morphological and biological correlation. , 1962, Acta pathologica et microbiologica Scandinavica. Supplement.

[21]  F. Civeira,et al.  Guidelines for the diagnosis and management of heterozygous familial hypercholesterolemia. , 2004, Atherosclerosis.

[22]  Dana C. Crawford,et al.  The detection and characterization of pleiotropy: discovery, progress, and promise , 2016, Briefings Bioinform..

[23]  Manolis Kellis,et al.  Deep learning for regulatory genomics , 2015, Nature Biotechnology.

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Jeff Shrager,et al.  A Novel Classification of Lung Cancer into Molecular Subtypes , 2012, PloS one.

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Long Wang,et al.  Win-Stay-Lose-Learn Promotes Cooperation in the Spatial Prisoner's Dilemma Game , 2012, PloS one.

[28]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[29]  Blaine A. Price,et al.  Remote electronic examinations: student experiences , 2002, Br. J. Educ. Technol..

[30]  Charles A Powell,et al.  International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society: international multidisciplinary classification of lung adenocarcinoma: executive summary. , 2011, Proceedings of the American Thoracic Society.

[31]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  Johann S. Hawe,et al.  Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression , 2014, Nature Biotechnology.

[34]  George Hripcsak,et al.  Coding Neuroradiology Reports for the Northern Manhattan Stroke Study: A Comparison of Natural Language Processing and Manual Review , 2000, Comput. Biomed. Res..

[35]  C. Mountain,et al.  Revisions in the International System for Staging Lung Cancer. , 1997, Chest.

[36]  Christoph Wick,et al.  Augmented Reality Simulator for Training in Two-Dimensional Echocardiography , 2000, Comput. Biomed. Res..

[37]  Aurélie Labbe,et al.  Using disease symptoms to improve detection of linkage under genetic heterogeneity , 2008, Genetic epidemiology.

[38]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[39]  Enhong Chen,et al.  Image Denoising and Inpainting with Deep Neural Networks , 2012, NIPS.

[40]  M. Brilliant,et al.  A PheWAS approach in studying HLA-DRB1*1501 , 2013, Genes and Immunity.

[41]  Tshilidzi Marwala,et al.  Missing data: A comparison of neural network and expectation maximization techniques , 2007 .

[42]  B. Mohammadi,et al.  ALSFRS-R score and its ratio: A useful predictor for ALS-progression , 2008, Journal of the Neurological Sciences.

[43]  Masahiro Tsuboi,et al.  International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society International Multidisciplinary Classification of Lung Adenocarcinoma , 2011, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[44]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[45]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[46]  Casey S. Greene,et al.  Unsupervised Feature Construction and Knowledge Extraction from Genome-Wide Assays of Breast Cancer with Denoising Autoencoders , 2014, Pacific Symposium on Biocomputing.

[47]  M. Alda,et al.  The Impact of Phenotypic and Genetic Heterogeneity on Results of Genome Wide Association Studies of Complex Diseases , 2013, PloS one.

[48]  Naomi R. Wray,et al.  Assessment of Response to Lithium Maintenance Treatment in Bipolar Disorder: A Consortium on Lithium Genetics (ConLiGen) Report , 2013, PloS one.

[49]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[50]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[51]  Ya Zhang,et al.  A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records , 2016, bioRxiv.