Effective selection of informative SNPs and classification on the HapMap genotype data

BackgroundSince the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations.ResultsIn this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs.ConclusionExperimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.

[1]  Lipo Wang,et al.  Data Mining With Computational Intelligence , 2006, IEEE Transactions on Neural Networks.

[2]  Lipo Wang Support vector machines : theory and applications , 2005 .

[3]  Noah A. Rosenberg Algorithms for Selecting Informative Marker Panels for Population Assignment , 2005, J. Comput. Biol..

[4]  S. Wright THE INTERPRETATION OF POPULATION STRUCTURE BY F‐STATISTICS WITH SPECIAL REGARD TO SYSTEMS OF MATING , 1965 .

[5]  I. Măndoiu,et al.  Highly Scalable Genotype Phasing by Entropy Minimization , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[7]  Ilya Levner,et al.  Feature selection and nearest centroid classification for protein mass spectrometry , 2005, BMC Bioinformatics.

[8]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[9]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[10]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[12]  Doheon Lee,et al.  SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms , 2006, Nucleic Acids Res..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  R. Ward,et al.  Informativeness of genetic markers for inference of ancestry. , 2003, American journal of human genetics.

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[17]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[18]  Zhen Lin,et al.  Choosing SNPs using feature selection , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[19]  William M. K. Trochim Research Methods: The Concise Knowledge Base , 2004 .

[20]  Russell Schwartz,et al.  Haplotypes and informative SNP selection algorithms: don't block out information , 2003, RECOMB '03.

[21]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[22]  Paul D. Minton,et al.  Statistics: The Exploration and Analysis of Data , 2002, Technometrics.

[23]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Russell Schwartz,et al.  Optimal Haplotype Block-free Selection of Tagging Snps for Genome-wide Association Studies , 2022 .

[25]  William M. K. Trochim,et al.  Research methods knowledge base , 2001 .

[26]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .