Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes

BackgroundCommonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology.ResultsWe present a more formal approach, the modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis.ConclusionThe modk-prototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and non-angina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several well-known and successful clustering algorithms. Following modk-prototypes clustering of the acetaminophen-exposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminophen-induced centrilobular necrosis. The use of well-known and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable.

[1]  Arie Perry,et al.  Mantel statistics to correlate gene expression levels from microarrays with clinical covariates , 2002, Genetic epidemiology.

[2]  U Wormser,et al.  Increased levels of hepatic metallothionein in rat and mouse after injection of acetaminophen. , 1988, Toxicology.

[3]  Lee Bennett,et al.  Prediction of compound signature using high density gene expression profiling. , 2002, Toxicological sciences : an official journal of the Society of Toxicology.

[4]  Carole R. Baskin,et al.  Integration of Clinical Data, Pathology, and cDNA Microarrays in Influenza Virus-Infected Pigtailed Macaques (Macaca nemestrina) , 2004, Journal of Virology.

[5]  J. Waring Development of a DNA Microarray for Toxicology Based on Hepatotoxin-Regulated Sequences , 2002, EHP toxicogenomics : journal of the National Institute of Environmental Health Sciences.

[6]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[7]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[8]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Shing I. Chang,et al.  Determination of cluster number in clustering microarray data , 2005, Appl. Math. Comput..

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  B B Brodie,et al.  Acetaminophen-induced hepatic necrosis. II. Role of covalent binding in vivo. , 1973, The Journal of pharmacology and experimental therapeutics.

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S. Thorgeirsson,et al.  Acetaminophen-induced hepatic necrosis. VI. Metabolic disposition of toxic and nontoxic doses of acetaminophen. , 1974, Pharmacology.

[15]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[16]  G. LeBlanc A Textbook of Modern Toxicology , 2004 .

[17]  Ben van Ommen,et al.  Bromobenzene-induced hepatotoxicity at the transcriptome level. , 2004, Toxicological sciences : an official journal of the Society of Toxicology.

[18]  Michael D. Waters,et al.  Toxicogenomics and systems toxicology: aims and prospects , 2004, Nature Reviews Genetics.

[19]  P. Rao Statistical Research Methods in the Life Sciences , 1997 .

[20]  Joel S. Parker,et al.  Transcriptional Profiling of the Left and Median Liver Lobes of Male F344/N Rats Following Exposure to Acetaminophen , 2005, Toxicologic pathology.

[21]  Dustin P. Potter,et al.  Heritable clustering and pathway discovery in breast cancer integrating epigenetic and phenotypic data , 2007, BMC Bioinformatics.

[22]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[23]  J. Waring,et al.  The promise of toxicogenomics. , 2002, Current opinion in molecular therapeutics.

[24]  Charles Wang,et al.  Integrating time-course microarray gene expression profiles with cytotoxicity for identification of biomarkers in primary rat hepatocytes exposed to cadmium , 2006, Bioinform..

[25]  Jing Yin,et al.  An Unsupervised Approach to Identify Molecular Phenotypic Components Influencing Breast Cancer Features , 2004, Cancer Research.

[26]  Raj Acharya,et al.  Clustering of diverse genomic data using information fusion , 2004, SAC '04.

[27]  Gilbert S. Omenn,et al.  Toxicogenomics: Principles and Applications , 2004, Environmental Health Perspectives.

[28]  J. Trent,et al.  Microarrays and toxicology: The advent of toxicogenomics , 1999, Molecular carcinogenesis.

[29]  Michael D Waters,et al.  The impact of new technologies on human population studies. , 2003, Mutation research.

[30]  Michael D Waters,et al.  Quality Review Procedures Necessary for Rodent Pathology Databases and Toxicogenomic Studies: The National Toxicology Program Experience , 2002, Toxicologic pathology.

[31]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[32]  Kerry Blanchard,et al.  Methapyrilene Toxicity: Anchorage of Pathologic Observations to Gene Expression Alterations , 2002, Toxicologic pathology.

[33]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[34]  R. Paules,et al.  Phenotypic anchoring: linking cause and effect. , 2003, Environmental health perspectives.

[35]  K. Morgan Gene expression analysis reveals chemical-specific profiles. , 2002, Toxicological sciences : an official journal of the Society of Toxicology.

[36]  Ian Pate,et al.  Phenotypic Anchoring of Gene Expression Changes during Estrogen-Induced Uterine Growth , 2004, Environmental health perspectives.

[37]  Shinichi Morishita,et al.  Constrained clusters of gene expression profiles with pathological features , 2004, Bioinform..

[38]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[39]  Gordon Vansant,et al.  Gene expression profiling of rat livers reveals indicators of potential adverse effects. , 2004, Toxicological sciences : an official journal of the Society of Toxicology.

[40]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[41]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[42]  H. Jaeschke,et al.  Transcriptional activation of heme oxygenase-1 and its functional significance in acetaminophen-induced hepatitis and hepatocellular injury in the rat. , 2000, Journal of hepatology.

[43]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[44]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .