Decision making using incomplete data

Decision-making often relies on relevant information extracted from data. To obtain such information, many data analysis techniques can be applied, including statistical analysis, clustering algorithms and modeling techniques using neural nets or machine learning. Unfortunately, in practice, missing data is common and most analysis techniques are not applicable to incomplete data. This paper investigates an approach to handling missing data, using heuristics, in a machine learning system, SORCER. We applied SORCER to decide if certain characteristics of COLIA1 gene mutations are or are not associated with fatal type of, OI (osteogenesis imperfecta), a genetic disease. We compare the accuracies of SORCER's decisions with a high performing machine learning system, See5 with different percentages of missing data. The results show that average accuracies obtained from See5 tend to decline as the degree of incompleteness increases at a greater rate than those obtained from SORCER

[1]  Raymond Dalgleish,et al.  The human type I collagen mutation database , 1997, Nucleic Acids Res..

[2]  Raymond Dalgleish,et al.  The Human Collagen Mutation Database 1998 , 1998, Nucleic Acids Res..

[3]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[4]  Marc Gyssens,et al.  The Structure of the Relational Database Model , 1989, EATCS Monographs on Theoretical Computer Science.

[5]  Carlo Zaniolo,et al.  Database relations with null values , 1982, J. Comput. Syst. Sci..

[6]  Teri E. Klein,et al.  Analysis of Mutations in the COLIA1 Gene with Second-Order Rule Induction , 2003, Int. J. Pattern Recognit. Artif. Intell..

[7]  J. R. Quinlan,et al.  Data Mining Tools See5 and C5.0 , 2004 .

[8]  Lawrence Hunter,et al.  Finding Relevant Biomolecular Features , 1993, ISMB.

[9]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[10]  T. Klein,et al.  Neural networks applied to the collagenous disease Osteogenesis imperfecta , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[11]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[12]  Rattikorn Hewett,et al.  Knowledge Discovery with Second-Order Relations , 2002, Knowledge and Information Systems.

[13]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[14]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[15]  P. Kollman,et al.  Computed free energy differences between point mutations in a collagen-like peptide. , 2001, Biopolymers.