A Combination of multiple imputation and principal component analysis to handle missing value with arbitrary pattern

Hepatitis is one of the major health problems which can progress to chronic hepatitis and cancer. Currently, computer based diagnosis is commonly use among medical examination. The diagnosis has been examined by using the disease dataset as a reference to make the decisions. However, the dataset was incomplete because it contained many instances containing missing values. This situation can lead the results of the analysis to be biased. One method of handling missing values is Multiple Imputation. Hepatitis dataset has an arbitrary pattern of missing values. This pattern can be handled by using Markov Chain Monte Carlo (MCMC) and Fully Conditional Specification (FCS) as Multiple Imputation algorithms. The research conducted an experiment to compare combinations of Multiple Imputations algorithm and Principal Component Analysis (PCA) as instance selection. Instance selection applied to reduce data by selecting variables that contribute greatly to the dataset. The goal was to improve the accuracy of the analysis on data which had missing values with the arbitrary pattern. The results showed that FCS-PCA is the best performance with the higher accuracy (98.80%) and the lowest error rate (0.0116).

[1]  J. Feld,et al.  Hepatitis C virus infection , 2015, Canadian Medical Association Journal.

[2]  Chih-Fong Tsai,et al.  Combining instance selection for better missing value imputation , 2016, J. Syst. Softw..

[3]  R. Hecht-Nielsen Kolmogorov''s Mapping Neural Network Existence Theorem , 1987 .

[4]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[5]  Jean-Michel Pawlotsky,et al.  Hepatitis C virus infection , 2017, Nature Reviews Disease Primers.

[6]  T. Hogg,et al.  Multiple imputation and maximum likelihood principal component analysis of incomplete multivariate data from a study of the ageing of port , 2001 .

[7]  Vesna Dušak,et al.  MISSING DATA PROBLEMS IN NON-GAUSSIAN PROBABILITY DISTRIBUTIONS , 2015 .

[8]  Gairy F Hall Hepatitis A, B, C, D, E, G: an update. , 2007, Ethnicity & disease.

[9]  L. Gorgos Sexual transmission of viral hepatitis. , 2013, Infectious disease clinics of North America.

[10]  M. Buendia,et al.  Diverse roles of hepatitis B virus in liver cancer. , 2012, Current opinion in virology.

[11]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[12]  Danielle M. Yugo,et al.  Hepatitis E Virus: Foodborne, Waterborne and Zoonotic Transmission , 2013, International journal of environmental research and public health.

[13]  Vesna Dušak,et al.  PROBLEMI NEDOSTAJUĆIH PODATAKA U DISTRIBUCIJAMA VJEROJATNOSTI KOJE NISU GAUSSOVE , 2016 .

[14]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[15]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.