Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data

BackgroundTrait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heterogeneity. The performance of three such methods – Bayesian Classification, Hypergraph-Based Clustering, and Fuzzy k-Modes Clustering – appropriate for categorical data were compared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heterogeneity and/or gene-gene interaction, which are two other complicating factors in discovering genetic models of complex human disease. To determine the efficacy of applying the Bayesian Classification method to real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated using permutation testing.ResultsBayesian Classification outperformed the other two methods, with the exception that the Fuzzy k-Modes Clustering performed best on the most complex genetic model. Bayesian Classification achieved excellent recovery for 75% of the datasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datasets with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfunctional loci (across all simulated models). Neither Hypergraph Clustering nor Fuzzy k-Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a restricted set of conditions. When using the average log of class strength as the internal clustering metric, the false positive rate was controlled very well, at three percent or less for all three significance levels (0.01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent significance level of 0.10.ConclusionBayesian Classification shows promise as an unsupervised computational method for dissecting trait heterogeneity in genotypic data. Its control of false positive and false negative rates lends confidence to the validity of its results. Further investigation of how different parameter settings may improve the performance of Bayesian Classification, especially under more complex genetic models, is ongoing.

[1]  Jason H. Moore,et al.  A global view of epistasis , 2005, Nature Genetics.

[2]  Carl D Langefeld,et al.  Ordered subset analysis in genetic linkage mapping of complex traits , 2004, Genetic epidemiology.

[3]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[4]  R. Rosenberg Autosomal dominant cerebellar phenotypes , 1990, Neurology.

[5]  Peter Cheeseman,et al.  Bayesian classification theory , 1991 .

[6]  S Povey,et al.  The genetic basis of tuberous sclerosis. , 1998, Molecular medicine today.

[7]  John Collinge,et al.  Homozygous prion protein genotype predisposes to sporadic Creutzfeldt–Jakob disease , 1991, Nature.

[8]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[9]  R. Lathe,et al.  Neuropathological phenotype and ‘prion protein’ genotype correlation in sporadic Creutzfeldt-Jakob disease , 1994, Neuroscience Letters.

[10]  S. Narod,et al.  The impact of family history on early detection of prostate cancer , 1995, Nature Medicine.

[11]  S. Folstein,et al.  Incorporating language phenotypes strengthens evidence of linkage to autism. , 2001, American journal of medical genetics.

[12]  L. Cavalli-Sforza,et al.  Multilocus genotypes, a tree of individuals, and human evolutionary history. , 1997, American journal of human genetics.

[13]  J. Kurtzke,et al.  Multiple sclerosis: changing times. , 1991, Neuroepidemiology.

[14]  Jürg Ott,et al.  Set association analysis of SNP case-control and microarray data , 2002, RECOMB '02.

[15]  P. Brown,et al.  Creutzfeldt‐Jakob disease cosegregates with the codon 178Asn PRNP mutation in families of European origin , 1992, Annals of neurology.

[16]  H. Tager-Flusberg,et al.  Identifying neurocognitive phenotypes in autism. , 2003, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[18]  Jason H. Moore,et al.  Ideal discrimination of discrete clinical endpoints using multilocus genotypes , 2004, Silico Biol..

[19]  J. Bellanti,et al.  A clinical perspective of cystic fibrosis and new genetic findings: Relationship of CFTR mutations to genotype–phenotype manifestations , 2003, American journal of medical genetics. Part A.

[20]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[21]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[22]  J. Collinge,et al.  CJD discrepancy , 1991, Nature.

[23]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[24]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[25]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[26]  Minerva M. Carrasquillo,et al.  Genome-wide association study and mouse model identify interaction between RET and EDNRB pathways in Hirschsprung disease , 2002, Nature Genetics.

[27]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[28]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[29]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[30]  M. Carroll,et al.  Overweight and obesity in the United States: prevalence and trends, 1960–1994 , 1998, International Journal of Obesity.

[31]  N. Schork,et al.  Who's afraid of epistasis? , 1996, Nature Genetics.

[32]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[33]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[34]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[35]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[36]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[37]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[38]  Y. Sakaki,et al.  Pro----leu change at position 102 of prion protein is the most common but not the sole mutation related to Gerstmann-Sträussler syndrome. , 1989, Biochemical and biophysical research communications.

[39]  J. Ott,et al.  Strategies for characterizing highly polymorphic markers in human gene mapping. , 1992, American journal of human genetics.

[40]  I Vuillaume,et al.  Clinical features and genetic analysis of a new form of spinocerebellar ataxia , 2001, Neurology.

[41]  M. Palmer,et al.  Genetic predisposition to iatrogenic Creutzfeldt-Jakob disease , 1991, The Lancet.

[42]  J. Attwood,et al.  Two loci for Tuberous Sclerosis: one on 9q34 and one on 16p13 , 1994, Annals of human genetics.

[43]  T. Crow,et al.  A codon 129 polymorphism in the PRIP gene. , 1990, Nucleic acids research.

[44]  A. Harding The clinical features and classification of the late onset autosomal dominant cerebellar ataxias. A study of 11 families, including descendants of the 'the Drew family of Walworth'. , 1982, Brain : a journal of neurology.

[45]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[46]  Wentian Li,et al.  A Complete Enumeration and Classification of Two-Locus Disease Models , 1999, Human Heredity.

[47]  C. A. Smith,et al.  Testing for heterogeneity of recombination fraction values in Human Genetics , 1963, Annals of human genetics.

[48]  A. C. Fabian,et al.  Binary precursor for planet? , 1991, Nature.

[49]  Carlo Rivolta,et al.  Retinitis pigmentosa and allied diseases: numerous diseases, genes, and inheritance patterns. , 2002, Human molecular genetics.

[50]  J. Gilbert,et al.  Phenotypic homogeneity provides increased support for linkage on chromosome 2 in autistic disorder. , 2001, American journal of human genetics.

[51]  George Karypis,et al.  LPMiner: an algorithm for finding frequent itemsets using length-decreasing support constraint , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[52]  G. Karypis,et al.  Clustering In A High-Dimensional Space Using Hypergraph Models , 2004 .

[53]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[54]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.