Shrunken Dissimilarity Measure for Genome-wide SNP Data Classification ∗

Recent development of high-resolution single-nucleotide polymorphism (SNP) arrays allows detailed assessment of genome-wide human genome variations. However, SNP data typi- cally has a large number of SNPs (e.g., 400 thousand SNPs in genome-wide Parkinson disease SNP data) and a few hundred of samples. Conventional classification methods may not be effective when applied to such genome-wide SNP data. In this paper, we propose to develop and use shrunken dis- similarity measure to analyze and select relevant SNPs for classification problems. Examples for HapMap data and Parkinson data are given to demonstrate the effectiveness of the proposed method and illustrate it has the potential to become a useful analysis tool for SNP data sets. In particular, we find some SNPs in chromosome 2 that they contain in some genes which is relevant to Parkinson disease.

[1]  B. G. Rothberg,et al.  Mapping a role for SNPs in drug development , 2001, Nature Biotechnology.

[2]  Holger Schwender,et al.  Classification with High‐Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes , 2008, Biometrical journal. Biometrische Zeitschrift.

[3]  Doheon Lee,et al.  SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms , 2006, Nucleic Acids Res..

[4]  A. Syvänen Toward genome-wide SNP genotyping , 2005, Nature Genetics.

[5]  Hreinn Stefánsson,et al.  A susceptibility gene for late‐onset idiopathic Parkinson's disease , 2002, Annals of neurology.

[6]  P. Tonali,et al.  γ1- and γ2-Syntrophins, Two Novel Dystrophin-binding Proteins Localized in Neuronal Cells* , 2000, The Journal of Biological Chemistry.

[7]  F M Watt,et al.  Out of Eden: stem cells and their niches. , 2000, Science.

[8]  Hiroshi Sato,et al.  Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction , 2002, Nature Genetics.

[9]  A. Brookes The essence of SNPs. , 1999, Gene.

[10]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[11]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[12]  Holger Schwender,et al.  Modifying Microarray Analysis Methods for Categorical Data - SAM and PAM for SNPs , 2004, GfKl.

[13]  Robert Tibshirani,et al.  Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer , 2003, SKDD.

[14]  S. Chanock,et al.  SNPs in cancer research and treatment , 2004, British Journal of Cancer.

[15]  Michael K. Ng,et al.  Unidimensional nonnegative scaling for genome-wide Linkage Disequilibrium maps , 2008, Int. J. Bioinform. Res. Appl..

[16]  M. Cargill Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999, Nature Genetics.

[17]  P. Tonali,et al.  Gamma1- and gamma2-syntrophins, two novel dystrophin-binding proteins localized in neuronal cells. , 2000, The Journal of biological chemistry.

[18]  E. Salina,et al.  SNP markers: Methods of analysis, ways of development, and comparison on an example of common wheat , 2006, Russian Journal of Genetics.

[19]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.