Algorithms for regression and classification: robust regression and genetic association studies

Regression and classification are statistical techniques that may be used to extract rules and patterns out of data sets. Analyzing the involved algorithms comprises interdisciplinary research that offers interesting problems for statisticians and computer scientists alike. The focus of this thesis is on robust regression and classification in genetic association studies. In the context of robust regression, new exact algorithms and results for robust online scale estimation with the estimators Qn and Sn and for robust linear regression in the plane with the estimator least quartile difference (LQD) are presented. Additionally, an evolutionary computation algorithm for robust regression with different estimators in higher dimensions is devised. These estimators include the widely used least median of squares (LMS) and least trimmed squares (LTS). For classification in genetic association studies, this thesis describes a Genetic Programming algorithm that outpeforms the standard approaches on the considered data sets. It is able to identify interesting genetic factors not found before in a data set on sporadic breast cancer and to handle larger data sets than the compared methods. In addition, it is extendible to further application fields.

[1]  Holger Schwender,et al.  Statistical analysis of genotype and gene expression data , 2007 .

[2]  Ursula Gather,et al.  Repeated median and hybrid filters , 2006, Comput. Stat. Data Anal..

[3]  S. Garte,et al.  Metabolic susceptibility genes as cancer risk factors: time for a reassessment? , 2001, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[4]  Lothar Thiele,et al.  Quality Assessment of Pareto Set Approximations , 2008, Multiobjective Optimization.

[5]  Donald B. Johnson,et al.  Selecting the Kth element in X + Y and X_1 + X_2 + ... + X_m , 1978, SIAM J. Comput..

[6]  Colin R. Reeves,et al.  Evolutionary computation: a unified approach , 2007, Genetic Programming and Evolvable Machines.

[7]  Wolfgang Banzhaf,et al.  Genetic Programming: An Introduction , 1997 .

[8]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[9]  J. L. Hodges,et al.  Estimates of Location Based on Rank Tests , 1963 .

[10]  John R. Koza,et al.  Genetic Programming II , 1992 .

[11]  Sariel Har-Peled Constructing cuttings in theory and practice , 1998, SCG '98.

[12]  Holger Schwender,et al.  Modifying Microarray Analysis Methods for Categorical Data - SAM and PAM for SNPs , 2004, GfKl.

[13]  Peter Bro Miltersen,et al.  Finding Small OBDDs for Incompletely Specified Truth Tables Is Hard , 2006, COCOON.

[14]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[15]  Nimrod Megiddo,et al.  Combinatorial optimization with rational objective functions , 1978, Math. Oper. Res..

[16]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[17]  Peter J. Rousseeuw,et al.  Robustness of the p-Subset Algorithm for Regression with High Breakdown Point , 1991 .

[18]  Francesco Battaglia,et al.  Fitting piecewise linear threshold autoregressive models by means of genetic algorithms , 2004, Comput. Stat. Data Anal..

[19]  Douglas M. Hawkins,et al.  Improved Feasible Solution Algorithms for High Breakdown Estimation , 1999 .

[20]  Douglas M. Hawkins,et al.  The feasible set algorithm for least median of squares regression , 1993 .

[21]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[23]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  H R Drew,et al.  Structure of a B-DNA dodecamer: conformation and dynamics. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[26]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[27]  V. Spokoiny,et al.  Statistical inference for time-inhomogeneous volatility models , 2004, math/0406430.

[28]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[29]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[30]  Thorsten Bernholt,et al.  Effiziente Algorithmen und Komplexität in der robusten Statistik , 2006 .

[31]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[32]  P. Rousseeuw,et al.  Alternatives to the Median Absolute Deviation , 1993 .

[33]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[34]  T. Bernholt Robust Estimators are Hard to Compute , 2006 .

[35]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[36]  Richard Cole,et al.  On k-hulls and related problems , 1984, STOC '84.

[37]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[38]  Jean Dussault,et al.  A multivalued switching algebra with Boolean properties , 1976 .

[39]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[40]  Leah E. Mechanic,et al.  Exploring SNP‐SNP interactions and colon cancer risk using polymorphism interaction analysis , 2006, International journal of cancer.

[41]  M. Shamos Geometry and statistics: problems at the interface , 1976 .

[42]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[43]  Stefan Droste Efficient Genetic Programming for Finding Good Generalizing Boolean Functions , 1998 .

[44]  Thomas Brüning,et al.  ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[45]  Robin Nunkesser RFreak – an R package for evolutionary computation , 2008 .

[46]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[47]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[48]  Timothy M. Chan Geometric Applications of a Randomized Optimization Technique , 1998, SCG '98.

[49]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[50]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[51]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[52]  I. Dryden,et al.  Highly Resistant Regression and Object Matching , 1999, Biometrics.

[53]  Peter J. Rousseeuw,et al.  Time-Efficient Algorithms for Two Highly Robust Estimators of Scale , 1992 .

[54]  Jasjeet S. Sekhon,et al.  Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data , 2004 .

[55]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[56]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[57]  Roland Fried,et al.  On the robust detection of edges in time series filtering , 2007, Comput. Stat. Data Anal..

[58]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[59]  J S Witte,et al.  Introduction: Analysis of Sequence Data and Population Structure , 2001, Genetic epidemiology.

[60]  Mary C. Meyer An Evolutionary Algorithm With Applications to Statistics , 2003 .

[61]  Christian T. Brownlees,et al.  Financial Econometric Analysis at Ultra-High Frequency: Data Handling Concerns , 2006, Comput. Stat. Data Anal..

[62]  Timothy J. Purcell Sorting and searching , 2005, SIGGRAPH Courses.

[63]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[64]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[65]  Donald B. Johnson,et al.  Lower Bounds for Selection in X + Y and Other Multisets , 1978, JACM.

[66]  Ursula Gather,et al.  Robust Online Scale Estimation in Time Series: A Regression-Free Approach , 2007 .

[67]  Yuichi Mori,et al.  Handbook of computational statistics : concepts and methods , 2004 .

[68]  S. Stigler Gauss and the Invention of Least Squares , 1981 .

[69]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[70]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[71]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[72]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[73]  Ingo Wegener,et al.  Branching Programs and Binary Decision Diagrams , 1987 .

[74]  Alberto L. Sangiovanni-Vincentelli,et al.  Multiple-Valued Minimization for PLA Optimization , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[75]  PETER J. ROUSSEEUW,et al.  Computing LTS Regression for Large Data Sets , 2005, Data Mining and Knowledge Discovery.

[76]  Micha Sharir,et al.  Efficient algorithms for geometric optimization , 1998, CSUR.

[77]  J. Steele,et al.  Time- and Space-Efficient Algorithms for Least Median of Squares Regression , 1987 .

[78]  Sergei Bespamyatnikh,et al.  An Optimal Algorithm for Closest-Pair Maintenance , 1998, Discret. Comput. Geom..

[79]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[80]  J. Hodges Efficiency in normal samples and tolerance of extreme values for some estimates of location , 1967 .

[81]  M. Hubert,et al.  A Robust Measure of Skewness , 2004 .

[82]  Ingo Ruczinski,et al.  Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications , 2004 .

[83]  M. LeBlanc,et al.  Logic Regression , 2003 .

[84]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[85]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[86]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[87]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[88]  R. Koenker,et al.  The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators , 1997 .

[89]  Thomas Bartz-Beielstein,et al.  Experimental research in evolutionary computation , 2007, GECCO '07.

[90]  Enhong Chen,et al.  Dynamic Clustering Using Multi-objective Evolutionary Algorithm , 2005, CIS.

[91]  Herbert Edelsbrunner,et al.  Simulation of simplicity: a technique to cope with degenerate cases in geometric algorithms , 1988, SCG '88.

[92]  M. Genton,et al.  Highly Robust Estimation of the Autocovariance Function , 2000 .

[93]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[94]  H. Edelsbrunner,et al.  Computing Least Median of Squares Regression Lines and Guided Topological Sweep , 1990 .

[95]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[96]  Peter Widmayer,et al.  k-Violation Linear Programming , 1994, Inf. Process. Lett..

[97]  Douglas F. Easton,et al.  Association studies for finding cancer-susceptibility genetic variants , 2004, Nature Reviews Cancer.

[98]  Stephen Y. H. Su,et al.  The Relationship Between Multivalued Switching Algebra and Boolean Algebra Under Different Definitions of Complement , 1972, IEEE Transactions on Computers.

[99]  T. Bollerslev,et al.  Generalized autoregressive conditional heteroskedasticity , 1986 .

[100]  Humberto Barreto,et al.  Least median of squares and regression through the origin , 2006, Comput. Stat. Data Anal..

[101]  M. Smid Maintaining the minimal distance of a point set in less than linear time , 1990 .

[102]  José Julio Espina Agulló An exchange algorithm for computing the least quartile difference estimator , 2002 .

[103]  P. Rousseeuw,et al.  Generalized S-Estimators , 1994 .

[104]  Remco C. Veltkamp,et al.  Parametric search made practical , 2002, SCG '02.

[105]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[106]  Ingo Wegener,et al.  The complexity of Boolean functions , 1987 .

[107]  Iain M. Johnstone,et al.  The Resistant Line and Related Regression Methods , 1985 .

[108]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[109]  O. Hössjer Rank-Based Estimates in the Linear Model with High Breakdown Point , 1994 .

[110]  Mia Hubert,et al.  Recent developments in PROGRESS , 1997 .

[111]  Arnold J. Stromberg,et al.  Computing the Exact Least Median of Squares Estimate and Stability Diagnostics in Multiple Linear Regression , 1993, SIAM J. Sci. Comput..

[112]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[113]  Leonidas J. Guibas,et al.  The power of geometric duality , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).