Target-Focused Feature Selection Using Uncertainty Measurements in Healthcare Data

Healthcare big data remains under-utilized due to various incompatibility issues between the domains of data analytics and healthcare. The lack of generalizable iterative feature acquisition methods under budget and machine learning models that allow reasoning with a model’s uncertainty are two examples. Meanwhile, a boost to the available data is currently under way with the rapid growth in the Internet of Things applications and personalized healthcare. For the healthcare domain to be able to adopt models that take advantage of this big data, machine learning models should be coupled with more informative, germane feature acquisition methods, consequently adding robustness to the model’s results. We introduce an approach to feature selection that is based on Bayesian learning, allowing us to report the level of uncertainty in the model, combined with false-positive and false-negative rates. In addition, measuring target-specific uncertainty lifts the restriction on feature selection being target agnostic, allowing for feature acquisition based on a target of focus. We show that acquiring features for a specific target is at least as good as deep learning feature selection methods and common linear feature selection approaches for small non-sparse datasets, and surpasses these when faced with real-world data that is larger in scale and sparseness.

[1]  Sejong Oh,et al.  Improved Measures of Redundancy and Relevance for mRMR Feature Selection , 2019, Comput..

[2]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[3]  Terry Anthony Byrd,et al.  Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations , 2018 .

[4]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Natalie de Souza Stem cells: Blood matters , 2012, Nature Methods.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Tomás Pevný,et al.  Classification with Costly Features using Deep Reinforcement Learning , 2019, AAAI.

[8]  Dustin Tran,et al.  Edward: A library for probabilistic modeling, inference, and criticism , 2016, ArXiv.

[9]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[10]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[11]  John Yearwood,et al.  A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis , 2016, IEEE Access.

[12]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[13]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[14]  Günther Palm,et al.  Information storage and effective data retrieval in sparse matrices , 1989, Neural Networks.

[15]  Majid Sarrafzadeh,et al.  Nutrition and Health Data for Cost-Sensitive Learning , 2019, ArXiv.

[16]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[17]  Zoubin Ghahramani,et al.  Probabilistic machine learning and artificial intelligence , 2015, Nature.

[18]  C. Lee,et al.  Medical big data: promise and challenges , 2017, Kidney research and clinical practice.

[19]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Adnan Darwiche,et al.  Same-decision probability: A confidence measure for threshold-based decisions , 2012, Int. J. Approx. Reason..

[21]  Stefan M. Herzog,et al.  Experimental biology: Sometimes Bayesian statistics are better , 2013, Nature.

[22]  J. Kai,et al.  Can machine-learning improve cardiovascular risk prediction using routine clinical data? , 2017, PloS one.