A SNP Microarray Analysis Pipeline Using Machine Learning Techniques

EVANS, DANIEL T., M.S., November 2010, Computer Science A SNP Microarray Analysis Pipeline Using Machine Learning Techniques (118 pp.) Director of Thesis: Lonnie R. Welch A software pipeline has been developed to aide in SNP microarray analysis in case/control genome-wide association (GWA) studies. The pipeline uses data taken from previous GWA studies from the NCBI Gene Expression Omnibus website and analyzes the SNP information from these studies to create predictive classifiers. These classifiers attempt to accurately predict if individuals have a particular phenotype based on their genotypes. Two different methods were used to create these predictive models. One makes use of a popular machine learning technique, support vector machines, and the other is a simpler method that uses genotype total differences between cases and controls. One major benefit of using the support vector machine method is the ability to integrate and consider many combinations of SNPs in a computationally inexpensive manner. The GSE13117 dataset, which consists of mentally retarded children and their parents, and the GSE9222 dataset, which consists of autistic patients and their parents, were used to test the software pipeline. A Bayesian confidence interval was used in reporting classifier performance in addition to 5-repeated 10-fold cross-validation (5r-10cv). For the GSE9222 data set, the top performing model achieved a balanced accuracy of 70.8% and a normal accuracy of 71.7% using 5r-10cv. The model that had the distribution with the highest upper bound had a 95% confidence balanced accuracy interval of 62.1% to 75.3%. For the GSE13117 data set, the top performing classifier achieved a balanced accuracy of 56.2% and a normal accuracy of 65.7% using 5r-10cv. The model that had the distribution with the highest upper bound for the GSE13117 data set had a 95% confidence balanced accuracy interval of 49.6% to 68.3%. Such classifiers will eventually lead to new insights into disease and allow for simpler and more accurate diagnoses in the future.