Disease-Specific Risk Prediction through Stability Selection using Electronic Health Records

Disease-specific risk prediction aims at assessing the risk of a patient in developing a target disease based on his/her health profile. As electronic health records (EHRs) become more prevalent, a large number of features can be constructed in order to characterize patient profiles. This wealth of data provides unprecedented opportunities for data mining researchers to address important biomedical questions. Practical data mining challenges include: How to correctly select and rank those features based on their prediction power? What predictive model performs the best in predicting a target disease using those features? In this paper, we propose top-k stability selection, which generalizes a powerful sparse learning method for feature selection by overcoming its limitation on parameter selection. In particular, our proposed top-k stability selection includes the original stability selection method as a special case given k = 1. Moreover, we show that the top-k stability selection is more robust by utilizing more information from selection probabilities than the original stability selection, and provides stronger theoretical properties. In a large set of real clinical prediction datasets, the top-k stability selection methods outperform many existing feature selection methods including the original stability selection. We also compare three competitive classification methods (SVM, logistic regression and random forest) to demonstrate the effectiveness of selected features by our proposed method in the context of clinical prediction applications. Finally, through several clinical applications on predicting heart failure related symptoms, we show that top-k stability selection can successfully identify important features that are clinically meaningful.

[1]  Paul M. Thompson,et al.  Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer's disease , 2012, NeuroImage.

[2]  Kaustubh Supekar,et al.  Estimation of functional connectivity in fMRI data using stability selection-based sparse partial correlation with elastic net penalty , 2012, NeuroImage.

[3]  Peter Bühlmann,et al.  Causal stability ranking , 2011, Bioinform..

[4]  Hariklia Eleftherohorinou,et al.  Pathway-driven gene stability selection of two rheumatoid arthritis GWAS identifies and validates new susceptibility genes in receptor mediated signalling pathways. , 2011, Human molecular genetics.

[5]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[6]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[7]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[8]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[9]  Ken Williams,et al.  Validation of Prediction of Diabetes by the Archimedes Model and Comparison With Other Predicting Models , 2008, Diabetes Care.

[10]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[11]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[12]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[13]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[14]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Patrick Royston,et al.  Risk stratification for in-hospital mortality in acutely decompensated heart failure. , 2005, JAMA.

[16]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[17]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Olivier Bousquet,et al.  Concentration Inequalities and Data-Dependent Error Bounds , 2003 .

[19]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[20]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[21]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[22]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[23]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[24]  W. Kannel,et al.  The natural history of congestive heart failure: the Framingham study. , 1971, The New England journal of medicine.

[25]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .