Statistical analysis of genotype and gene expression data

A common and important goal in cancer research is the identification of genetic markers such as genes or genetic variations that enable to determine if a person has a particular type of cancer, or lead to a higher risk of developing cancer. In recent years, many biotechnologies for measuring these markers have been developed. The most prominent examples are microarrays that can be used to, e.g, measure the expression levels of tens of thousands of genes simultaneously. The most widely used type of microarrays is the Affymetrix GeneChip on which each gene is represented by eleven pairs of probes. The corresponding probe intensities have to be preprocessed, i.e. summarized to one expression value per gene, before variable selection and classification methods can be applied to the gene expression data. This thesis is based on two projects: The goals of the first project are to identify the preprocessing method for Affymetrix microarrays that leads to the most efficient data reduction, and to provide a software enabling to apply this procedure to the data from studies comprising hundreds of Affymetrix GeneChips. The results of this project are presented in this thesis. The second project is concerned with SNPs (Single Nucleotide Polymorphisms), i.e. variations at a single base-pair position in the genome. While a vast number of papers on the analysis of gene expression data have been published, only a few variable selection and classification methods dealing with the specific needs of the analysis of SNP data have been proposed. One of the exceptions is logic regression. In this thesis, it is shown how approaches for the analysis of gene expression data can be adapted to SNP data, and a procedure based on a bagging version of logic regression is proposed that enables the detection of SNP interactions explanatory for a higher cancer risk. Furthermore, two measures for quantifying the importance of each of these interactions for prediction are presented, and compared with existing measures.

[1]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[2]  Ingo Ruczinski,et al.  Imputation Methods to Improve Inference in Snp Association Studies , 2022 .

[3]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  I. Yang,et al.  Molecular staging for survival prediction of colorectal cancer patients. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[6]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[7]  D. Clayton,et al.  A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. , 2002, American journal of human genetics.

[8]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[9]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[10]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[11]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[12]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[13]  S. Garte,et al.  Metabolic susceptibility genes as cancer risk factors: time for a reassessment? , 2001, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[14]  Felix Naef,et al.  Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[16]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[17]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Felix Naef,et al.  Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. , 2003, Nucleic acids research.

[20]  Larry Gonick,et al.  The cartoon guide to genetics , 1983 .

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[23]  Rafael A. Irizarry,et al.  Comparison of Affymetrix GeneChip expression measures , 2006, Bioinform..

[24]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[25]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[26]  J. Sydor,et al.  Protein expression profiling arrays: tools for the multiplexed high-throughput analysis of proteins , 2003, Proteome Science.

[27]  David C. Atkins,et al.  Gene expression profiles and molecular markers to predict recurrence of Dukes' B colon cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[28]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[29]  Katja Ickstadt,et al.  Cluster Analysis: A Comparison of Different Similarity Measures for SNP Data , 2005 .

[30]  Alan E. Hubbard,et al.  Empirical Bayes and Resampling Based Multiple Testing Procedure Controlling Tail Probability of the Proportion of False Positives. , 2005, Statistical applications in genetics and molecular biology.

[31]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[32]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[33]  Guide to Probe Logarithmic Intensity Error ( PLIER ) Estimation , 2005 .

[34]  Ingo Ruczinski,et al.  Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications , 2004 .

[35]  Katja Ickstadt,et al.  Similarity Measures for Clustering SNP Data , 2005 .

[36]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[37]  M. E. Johnson,et al.  A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data , 1981 .

[38]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[39]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[40]  Holger Schwender,et al.  Modifying Microarray Analysis Methods for Categorical Data - SAM and PAM for SNPs , 2004, GfKl.

[41]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[42]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[43]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[44]  Katja Ickstadt,et al.  Analyzing SNPs: Are There Needles in the Haystack? , 2006 .

[45]  Stat Pairs,et al.  Statistical Algorithms Description Document Genechip ® Array Design Data Outputs Stat Pairs Used , 2022 .

[46]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[47]  Holger Schwender Minimization of Boolean expressions using matrix algebra , 2007 .

[48]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[49]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[50]  Willard Van Orman Quine,et al.  The Problem of Simplifying Truth Functions , 1952 .

[51]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[52]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..

[53]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Holger Schwender,et al.  Do You Speak Genomish? , 2006 .

[55]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[56]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[57]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[58]  Korbinian Strimmer,et al.  Identifying periodically expressed transcripts in microarray time series data , 2008, Bioinform..

[59]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[60]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[61]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[62]  Roland Eils,et al.  High-Resolution Genomic Profiling Reveals Association of Chromosomal Aberrations on 1q and 16p with Histologic and Genetic Subgroups of Invasive Breast Cancer , 2006, Clinical Cancer Research.

[63]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[64]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[65]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[66]  M. J. van der Laan,et al.  Augmentation Procedures for Control of the Generalized Family-Wise Error Rate and Tail Probabilities for the Proportion of False Positives , 2004, Statistical applications in genetics and molecular biology.

[67]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[68]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[69]  John D. Storey A direct approach to false discovery rates , 2002 .

[70]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[71]  Thomas Brüning,et al.  ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[72]  Philippe Rigault,et al.  A novel, high-performance random array platform for quantitative gene expression profiling. , 2004, Genome research.

[73]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[74]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[75]  S. Sheather,et al.  Robust Estimation and Testing , 1990 .

[76]  Holger Schwender,et al.  Comparison of Preprocessing Methods for Affymetrix Microarrays , 2006 .

[77]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[78]  M. Kostrzewa,et al.  MALDI-TOF mass spectrometry-based SNP genotyping. , 2002, Pharmacogenomics.

[79]  Ingo Ruczinski,et al.  Logic Regression — Methods and Software , 2003 .

[80]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[81]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[82]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[83]  K. Jung Contributions to statistical techniques for the analysis of gene and protein expression data , 2006 .

[84]  J S Witte,et al.  Introduction: Analysis of Sequence Data and Population Structure , 2001, Genetic epidemiology.

[85]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[86]  K. Ickstadt,et al.  Identifying Interesting Genes with siggenes , 2006 .

[87]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[88]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[89]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[90]  K. Ickstadt,et al.  Breast cancer: a candidate gene approach across the estrogen metabolic pathway , 2007, Breast Cancer Research and Treatment.

[91]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[92]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[93]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[94]  John D. Storey,et al.  SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays , 2003 .

[95]  P. Royston Approximating the Shapiro-Wilk W-test for non-normality , 1992 .

[96]  Joachim Klose,et al.  Two‐dimensional electrophoresis of proteins: An updated protocol and implications for a functional analysis of the genome , 1995, Electrophoresis.

[97]  S. P. Fodor,et al.  Large-scale genotyping of complex DNA , 2003, Nature Biotechnology.

[98]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[99]  H. Büning,et al.  Nichtparametrische Statistische Methoden , 1994 .

[100]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[101]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[102]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[103]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..

[104]  E. McCluskey Minimization of Boolean functions , 1956 .