Improving covariance-regularized discriminant analysis for EHR-based predictive analytics of diseases

Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The performance of classical LDA however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for the ill-posed inverse problem . Existing approaches for HDLSS data classification typically assume the data in question are with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical analysis and extensive experiments on EHR datasets https://www.overleaf.com/project/5d2728c718f6ff3b2bcf5991 .

[1]  Noémie Elhadad,et al.  Corpus-Based Problem Selection for EHR Note Summarization. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[2]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[3]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[4]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[5]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..

[6]  Hui Xiong,et al.  Temporal Phenotyping from Longitudinal Electronic Health Records: A Graph Based Framework , 2015, KDD.

[7]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[8]  John Van Ness,et al.  The Use of Shrinkage Estimators in Linear Discriminant Analysis , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Fei Wang,et al.  PSF: A Unified Patient Similarity Evaluation Framework Through Metric Learning With Weak Supervision , 2015, IEEE Journal of Biomedical and Health Informatics.

[10]  W. V. McCarthy,et al.  Discriminant Analysis with Singular Covariance Matrices: Methods and Applications to Spectroscopic Data , 1995 .

[11]  Hongfang Liu,et al.  Populating Physician Biographical Pages Based on EMR Data , 2017, CRI.

[12]  Fei Wang,et al.  Supervised patient similarity measure of heterogeneous patient records , 2012, SKDD.

[13]  Chris Field,et al.  Small Sample Asymptotic Expansions for Multivariate $M$-Estimates , 1982 .

[14]  Kenneth S Kendler,et al.  Life event dimensions of loss, humiliation, entrapment, and danger in the prediction of onsets of major depression and generalized anxiety. , 2003, Archives of general psychiatry.

[15]  Hui Wang,et al.  Separability-Oriented Subclass Discriminant Analysis , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Susan Jensen Mining Medical Data for Predictive and Sequential patterns : PKDD 2001 , .

[17]  Hamido Fujita,et al.  Multi-view manifold regularized learning-based method for prioritizing candidate disease miRNAs , 2019, Knowl. Based Syst..

[18]  S. Geer,et al.  Confidence intervals for high-dimensional inverse covariance estimation , 2014, 1403.6752.

[19]  Marco Marozzi,et al.  Multivariate multidistance tests for high‐dimensional low sample size case‐control studies , 2015, Statistics in medicine.

[20]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[21]  Victoria J. Fraser,et al.  ICD-9 Codes and Surveillance for Clostridium difficile–associated Disease , 2006, Emerging infectious diseases.

[22]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[23]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[24]  Shanshan Zhang,et al.  Interpretable Representation Learning for Healthcare via Capturing Disease Progression through Time , 2018, KDD.

[25]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[26]  C. Smith Diagnostic tests (1) – sensitivity and specificity , 2012, Phlebology.

[27]  Haoyi Xiong,et al.  $\mathcal{DBSDA}$ : Lowering the Bound of Misclassification Rate for Sparse Linear Discriminant Analysis via Model Debiasing , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[28]  David G. Stork,et al.  Pattern Classification , 1973 .

[29]  Nigam H. Shah,et al.  Toward personalizing treatment for depression: predicting diagnosis and severity , 2014, J. Am. Medical Informatics Assoc..

[30]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.

[31]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[32]  Harrison H. Zhou,et al.  Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation , 2016 .

[33]  J. Shao,et al.  Sparse linear discriminant analysis by thresholding for high dimensional data , 2011, 1105.3561.

[34]  S. Sitharama Iyengar,et al.  Data-Driven Techniques in Disaster Information Management , 2017, ACM Comput. Surv..

[35]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[36]  Kenney Ng,et al.  Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity , 2015, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[37]  Edward R. Dougherty,et al.  Random matrix theory in pattern classification: An application to error estimation , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[38]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[39]  Shengrui Wang,et al.  Automated feature weighting in naive bayes for high-dimensional data classification , 2012, CIKM.

[40]  Edward R. Dougherty,et al.  Analytic Study of Performance of Error Estimators for Linear Discriminant Analysis , 2011, IEEE Transactions on Signal Processing.

[41]  Trevor J. Hastie,et al.  Sparse Discriminant Analysis , 2011, Technometrics.

[42]  Hamido Fujita,et al.  Inverse projection group sparse representation for tumor classification: A low rank variation dictionary approach , 2020, Knowl. Based Syst..

[43]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[44]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[45]  Motoaki Kawanabe,et al.  In Search of Non-Gaussian Components of a High-Dimensional Distribution , 2006, J. Mach. Learn. Res..

[46]  Parisa Rashidi,et al.  Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis , 2017, IEEE Journal of Biomedical and Health Informatics.

[47]  Haoyi Xiong,et al.  Early detection of diseases using electronic health records data and covariance-regularized linear discriminant analysis , 2017, 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[48]  Fei Wang,et al.  Health-ATM: A Deep Architecture for Multifaceted Patient Health Record Representation and Risk Prediction , 2018, SDM.

[49]  Jieping Ye,et al.  An optimization criterion for generalized discriminant analysis on undersampled problems , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[51]  Amir H. Payberah,et al.  Deep learning for electronic health records: A comparative review of multiple deep neural architectures , 2020, J. Biomed. Informatics.

[52]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[53]  James H. Harrison,et al.  Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record , 2018, IEEE Access.

[54]  Sheng Yu,et al.  Performance analysis and assessment of a tf-idf based archetype-SNOMED-CT binding algorithm , 2011, 2011 24th International Symposium on Computer-Based Medical Systems (CBMS).

[55]  James C. Turner,et al.  College Health Surveillance Network: Epidemiology and Health Care Utilization of College Students at US 4-Year Universities , 2015, Journal of American college health : J of ACH.

[56]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[57]  Yu Huang,et al.  M-SEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[58]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[59]  Jieping Ye,et al.  Two-Dimensional Linear Discriminant Analysis , 2004, NIPS.

[60]  Lin Sun,et al.  Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification , 2019, Inf. Sci..

[61]  Lin Sun,et al.  Joint neighborhood entropy-based gene selection method with fisher score for tumor classification , 2018, Applied Intelligence.

[62]  Le Song,et al.  GRAM: Graph-based Attention Model for Healthcare Representation Learning , 2016, KDD.