Bayesian clustering and feature selection for cancer tissue samples

BackgroundThe versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.ResultsWhile purely algorithmic methods are usually easily applicable, their suboptimal performance and limitations in making formal inference have been thoroughly discussed in the statistical literature. Here we introduce a Bayesian model-based approach to simultaneous identification of underlying tissue groups and the informative amplifications. The model-based approach provides the possibility of using formal inference to determine the number of groups from the data, in contrast to the ad hoc methods often exploited for similar purposes. The model also automatically recognizes the chromosomal areas that are relevant for the clustering.ConclusionValidatory analyses of simulated data and a large database of DNA copy number amplifications in human neoplasms are used to illustrate the potential of our approach. Our software implementation BASTA for performing Bayesian statistical tissue profiling is freely available for academic purposes at http://web.abo.fi/fak/mnf/mate/jc/software/basta.html

[1]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  A. Ullrich,et al.  The discovery of receptor tyrosine kinases: targets for cancer therapy , 2004, Nature Reviews Cancer.

[3]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[4]  S. Knuutila,et al.  Specificity, selection and significance of gene amplifications in cancer. , 2007, Seminars in cancer biology.

[5]  Pedro Larrañaga,et al.  Learning Recursive Bayesian Multinets for Data Clustering by Means of Constructive Induction , 2002, Machine Learning.

[6]  J. Corander,et al.  Random Partition Models and Exchangeability for Bayesian Identification of Population Structure , 2007, Bulletin of mathematical biology.

[7]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  D. Harnden,et al.  Catalog of Chromosome Aberrations in Cancer , 1989 .

[9]  A. Takaoka,et al.  Comparing antibody and small-molecule therapies for cancer , 2006, Nature Reviews Cancer.

[10]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[11]  S. Knuutila,et al.  Classification of human cancers based on DNA copy number amplification modeling , 2008, BMC Medical Genomics.

[12]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[13]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[14]  Jukka Corander,et al.  Bayesian search of functionally divergent protein subgroups and their function specific residues , 2006, Bioinform..

[15]  J. Baselga,et al.  Targeting Tyrosine Kinases in Cancer: The Second Wave , 2006, Science.

[16]  J. Corander,et al.  Bayesian identification of admixture events using multilocus molecular markers , 2006, Molecular ecology.

[17]  Mats Gyllenberg,et al.  Bayesian model learning based on a parallel MCMC strategy , 2006, Stat. Comput..

[18]  K Bock,et al.  Language production: Methods and methodologies , 1996, Psychonomic bulletin & review.

[19]  F. Mitelman,et al.  Catalog of Chromosome Aberrations in Cancer , 1996, British Journal of Cancer.

[20]  C. Geyer,et al.  Annealing Markov chain Monte Carlo with applications to ancestral inference , 1995 .

[21]  M. Verlaan,et al.  Classification of Binary Vectors by Stochastic Complexity , 1997 .

[22]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[24]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[25]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[26]  Jaakko Hollmén,et al.  Mixture Modeling of DNA Copy Number Amplification Patterns in Cancer , 2007, IWANN.

[27]  J.A. Lozano,et al.  Bayesian Model Averaging of Naive Bayes for Clustering , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Pekka Marttinen,et al.  A Bayesian method for identification of stock mixtures from molecular marker data , 2006 .

[29]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[30]  Alberto Prieto,et al.  Proceedings of the 9th international work conference on Artificial neural networks , 2007 .

[31]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .