Semi-automated Quality Assurance for Domain-Expert-Driven Data Exploration - An Application to Principal Component Analysis

Processing and exploring large quantities of electronic data is often a particularly interesting but yet challenging task. Both the lack of statistical and mathematical skills and the missing know-how of handling masses of (health) data constitute high barriers for profound data exploration – especially when performed by domain experts. This paper presents guided visual pattern discovery, by taking the well-established data mining method Principal Component Analysis as an example. Without guidance, the user has to be conscious about the reliability of computed results at any point during the analysis (GIGO-principle). In the course of the integration of principal component analysis into an ontology-guided research infrastructure, we include a guidance system supporting the user through the separate analysis steps and we introduce a quality measure, which is essential for profound research results.

[1]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[2]  K. Myklestad,et al.  Do parental heights influence pregnancy length?: a population-based prospective study, HUNT 2 , 2013, BMC Pregnancy and Childbirth.

[3]  H. Kaiser A second generation little jiffy , 1970 .

[4]  Yong Chen,et al.  Robust principal component analysis and outlier detection with ecological data , 2004 .

[5]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[6]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[7]  Andreas Holzinger,et al.  Ontology-Guided Principal Component Analysis: Reaching the Limits of the Doctor-in-the-Loop , 2016, ITBAM.

[8]  H. Kaiser,et al.  Little Jiffy, Mark Iv , 1974 .

[9]  J. Osborne,et al.  Sample size and subject to item ratio in principal components analysis. , 2004 .

[10]  John F. Roddick,et al.  Exploratory medical knowledge discovery: experiences and issues , 2003, SKDD.

[11]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[12]  James F. Brinkley,et al.  Issues in biomedical research data management and analysis: needs and barriers. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[13]  H. Thode Testing For Normality , 2002 .

[14]  B. Atiyeh,et al.  Abdominal compartment syndrome (ACS) in a severely burned patient. , 2015, Annals of burns and fire disasters.

[15]  Donghoh Kim,et al.  Comparing patterns of component loadings: Principal Component Analysis (PCA) versus Independent Component Analysis (ICA) in analyzing multivariate non-normal data , 2012, Behavior Research Methods.

[16]  J. Morrison,et al.  The Influence of Paternal Height and Weight on Birth‐weight , 1991, The Australian & New Zealand journal of obstetrics & gynaecology.

[17]  A. C. Rencher Methods of multivariate analysis , 1995 .

[18]  A Min Tjoa,et al.  Current Advances, Trends and Challenges of Machine Learning and Knowledge Extraction: From Machine Learning to Explainable AI , 2018, CD-MAKE.

[19]  James C. Hayton,et al.  Factor Retention Decisions in Exploratory Factor Analysis: a Tutorial on Parallel Analysis , 2004 .

[20]  Andreas Holzinger,et al.  Interactive knowledge discovery with the doctor-in-the-loop: a practical example of cerebral aneurysms research , 2016, Brain Informatics.

[21]  Johannes Dirnberger,et al.  An ontology‐based clinical data warehouse for scientific research , 2015 .

[22]  P. Miller,et al.  Contribution plots: a missing link in multivariate quality control , 1998 .

[23]  C. Dziuban,et al.  When is a correlation matrix appropriate for factor analysis? Some decision rules. , 1974 .