Evidential reasoning for preprocessing uncertain categorical data for trustworthy decisions: An application on healthcare and finance

Abstract The uncertainty attributed by discrepant data in AI-enabled decisions is a critical challenge in highly regulated domains such as health care and finance. Ambiguity and incompleteness due to missing values in output and input attributes, respectively, is ubiquitous in these domains. It could have an adverse impact on a certain unrepresented set of people in the training data without a developer’s intention to discriminate. The inherently non-numerical nature of categorical attributes than numerical attributes and the presence of incomplete and ambiguous categorical attributes in a dataset increases the uncertainty in decision-making. This paper addresses the challenges in handling categorical attributes as it is not addressed comprehensively in previous research. Three sources of uncertainties in categorical attributes are recognised in this research. The informational uncertainty, unforeseeable uncertainty in the decision task environment, and the uncertainty due to lack of pre-modelling explainability in categorical attributes are addressed in the proposed methodology on maximum likelihood evidential reasoning (MAKER). It can transform and impute incomplete and ambiguous categorical attributes into interpretable numerical features. It utilises a notion of weight and reliability to include subjective expert preference over a piece of evidence and the quality of evidence in a categorical attribute, respectively. The MAKER framework strives to integrate the recognised uncertainties in the transformed input data that allow a model to perceive data limitations during the training regime and acknowledge doubtful predictions by supporting trustworthy pre-modelling and post modelling explainability. The ability to handle uncertainty and its impact on explainability is demonstrated on a real-world healthcare and finance data for different missing data scenarios in three types of AI algorithms: deep-learning, tree-based, and rule-based model.

[1]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[2]  Sunil Prabhakar,et al.  Rule induction for uncertain data , 2011, Knowledge and Information Systems.

[3]  Jian-Bo Yang,et al.  Inferential modelling and decision making with data , 2017, 2017 23rd International Conference on Automation and Computing (ICAC).

[4]  Shouhong Wang,et al.  Discovering patterns of missing data in survey databases: An application of rough sets , 2009, Expert Syst. Appl..

[5]  Ofer Harel,et al.  The treatment of incomplete data: Reporting, analysis, reproducibility, and replicability. , 2018, Social science & medicine.

[6]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[7]  Enrico Zio,et al.  An integrated imputation-prediction scheme for prognostics of battery data with missing observations , 2019, Expert Syst. Appl..

[8]  Dong-Ling Xu,et al.  An evidential reasoning rule based feature selection for improving trauma outcome prediction , 2021, Appl. Soft Comput..

[9]  L. J. Bourgeois,et al.  Strategy and Environment: A Conceptual Integration , 1980 .

[10]  Jian-Bo Yang,et al.  The evidential reasoning approach for multi-attribute decision analysis under interval uncertainty , 2006, Eur. J. Oper. Res..

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Stanley Wasserman,et al.  Categorical variables in developmental research : methods of analysis , 1999 .

[13]  Chang-Hua Hu,et al.  A New Evidential Reasoning Rule-Based Safety Assessment Method With Sensor Reliability for Complex Systems , 2020, IEEE Transactions on Cybernetics.

[14]  Andrew Briggs,et al.  Missing... presumed at random: cost-analysis of incomplete data. , 2003, Health economics.

[15]  P. Roth MISSING DATA: A CONCEPTUAL REVIEW FOR APPLIED PSYCHOLOGISTS , 1994 .

[16]  D P MacKinnon,et al.  Maximizing the Usefulness of Data Obtained with Planned Missing Value Patterns: An Application of Maximum Likelihood Procedures. , 1996, Multivariate behavioral research.

[17]  Katya L. Masconi,et al.  Reporting and handling of missing data in predictive research for prevalent undiagnosed type 2 diabetes mellitus: a systematic review , 2015, EPMA Journal.

[18]  Jian-Bo Yang,et al.  Maximum Likelihood Evidential Reasoning-Based Hierarchical Inference with Incomplete Data , 2019, 2019 25th International Conference on Automation and Computing (ICAC).

[19]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[20]  Balázs Kégl,et al.  Similarity encoding for learning with dirty categorical variables , 2018, Machine Learning.

[21]  Kyle Bogosian,et al.  Implementation of Moral Uncertainty in Intelligent Machines , 2017, Minds and Machines.

[22]  Geert Verbeke,et al.  Multiple Imputation for Model Checking: Completed‐Data Plots with Missing and Latent Data , 2005, Biometrics.

[23]  Mortaza Jamshidian,et al.  Advances in Analysis of Mean and Covariance Structure when Data are Incomplete , 2007 .

[24]  Felix Bießmann,et al.  On Challenges in Machine Learning Model Management , 2018, IEEE Data Eng. Bull..

[25]  Jian-Bo Yang,et al.  Belief rule-base inference methodology using the evidential reasoning Approach-RIMER , 2006, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[26]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[27]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[28]  Susan Shur-Fen Gau,et al.  A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder , 2020, Frontiers in Psychiatry.

[29]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[30]  Yi Deng,et al.  Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data , 2016, Scientific Reports.

[31]  Arno Siebes,et al.  Smoothing Categorical Data , 2012, ECML/PKDD.

[32]  Jonathan A C Sterne,et al.  Accounting for missing data in statistical analyses: multiple imputation is not always the answer , 2019, International journal of epidemiology.

[33]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[34]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[35]  Hanru Wang,et al.  Group-Oriented Paper Recommendation With Probabilistic Matrix Factorization and Evidential Reasoning in Scientific Social Network , 2022, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[36]  M. Baneshi,et al.  Multiple Imputation in Survival Models: Applied on Breast Cancer Data , 2011, Iranian Red Crescent medical journal.

[37]  Arthur P. Dempster,et al.  Upper and Lower Probabilities Induced by a Multivalued Mapping , 1967, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[38]  R. Yager On the dempster-shafer framework and new combination rules , 1987, Inf. Sci..

[39]  L. A. Goodman Partitioning of Chi-Square, Analysis of Marginal Contingency Tables, and Estimation of Expected Frequencies in Multidimensional Contingency Tables , 1971 .

[40]  Yang Li,et al.  An explainable AI decision-support-system to automate loan underwriting , 2020, Expert Syst. Appl..

[41]  Pietro Ducange,et al.  A glimpse on big data analytics in the framework of marketing strategies , 2017, Soft Computing.

[42]  Carlos Guestrin,et al.  Model-Agnostic Interpretability of Machine Learning , 2016, ArXiv.

[43]  Philippe Smets,et al.  The Transferable Belief Model , 1991, Artif. Intell..

[44]  Theodore B. Trafalis,et al.  Missing Data Imputation Through Machine Learning Algorithms , 2009 .

[45]  XuDong-Ling,et al.  Data classification using evidence reasoning rule , 2017 .

[46]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[47]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[48]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[49]  Lora E. Fleming,et al.  Potential Changes in Disease Patterns and Pharmaceutical Use in Response to Climate Change , 2013, Journal of toxicology and environmental health. Part B, Critical reviews.

[50]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .

[51]  Swati Sachan,et al.  Explainable Artificial Intelligence for Digital Forensics: Opportunities, Challenges and a Drug Testing Case Study , 2020, Digital Forensic Science.

[52]  Haibo Hu,et al.  Environmental investment prediction using extended belief rule-based system and evidential reasoning rule , 2021 .

[53]  Jian-Bo Yang,et al.  On the evidential reasoning algorithm for multiple attribute decision analysis under uncertainty , 2002, IEEE Trans. Syst. Man Cybern. Part A.

[54]  Dong-Ling Xu,et al.  Evidential reasoning rule for evidence combination , 2013, Artif. Intell..

[55]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[56]  Kun Chang Lee,et al.  Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets , 2016, Expert Syst. Appl..

[57]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[58]  Md Hamidul Huque,et al.  A comparison of multiple imputation methods for missing data in longitudinal studies , 2018, BMC Medical Research Methodology.

[59]  Amir Abbas Rassafi,et al.  Application of evidential reasoning approach and OWA operator weights in road safety evaluation considering the best and worst practice frontiers , 2020 .

[60]  Jian-Bo Yang,et al.  Data classification using evidence reasoning rule , 2017, Knowl. Based Syst..

[61]  M. Baneshi,et al.  Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models? , 2012, Iranian Red Crescent medical journal.

[62]  Gang Li,et al.  Multivariable data imputation for the analysis of incomplete credit data , 2020, Expert Syst. Appl..

[63]  Roger L. Brown Efficacy of the indirect approach for estimating structural equation models with missing data: A comparison of five methods , 1994 .

[64]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[65]  Shari S. C. Shang,et al.  Managing Uncertainty in AI-Enabled Decision Making and Achieving Sustainability , 2020, Sustainability.

[66]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[67]  Henri Prade,et al.  Representation and combination of uncertainty with belief functions and possibility measures , 1988, Comput. Intell..

[68]  Errol R. Iselin The impact of information diversity on information overload effects in unstructured managerial decision making , 1989, J. Inf. Sci..

[69]  R. Lipshitz,et al.  Coping with Uncertainty: A Naturalistic Decision-Making Analysis , 1997 .

[70]  Frances J. Milliken Three Types of Perceived Uncertainty About the Environment: State, Effect, and Response Uncertainty , 1987 .

[71]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[72]  Jian-Bo Yang,et al.  Estimation, modeling, and aggregation of missing survey data for prioritizing customer voices , 2012, Eur. J. Oper. Res..

[73]  Warren E. Walker,et al.  Adapt or Perish: A Review of Planning Approaches for Adaptation under Deep Uncertainty , 2013 .

[74]  Yanjun Han,et al.  Does dirichlet prior smoothing solve the Shannon entropy estimation problem? , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[75]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[76]  Hussain Alkharusi,et al.  Categorical Variables in Regression Analysis: A Comparison of Dummy and Effect Coding , 2012 .

[77]  Dong-Ling Xu,et al.  Global and Local Interpretability of Belief Rule Base , 2020 .