Investigating Topic Models' Capabilities in Expression Microarray Data Classification

In recent years a particular class of probabilistic graphical models-called topic models-has proven to represent an useful and interpretable tool for understanding and mining microarray data. In this context, such models have been almost only applied in the clustering scenario, whereas the classification task has been disregarded by researchers. In this paper, we thoroughly investigate the use of topic models for classification of microarray data, starting from ideas proposed in other fields (e.g., computer vision). A classification scheme is proposed, based on highly interpretable features extracted from topic models, resulting in a hybrid generative-discriminative approach; an extensive experimental evaluation, involving 10 different literature benchmarks, confirms the suitability of the topic models for classifying expression microarray data.

[1]  Byoung-Tak Zhang,et al.  Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation , 2006, Bioinform..

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[4]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[5]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[6]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[7]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[8]  Juan José del Coz,et al.  Learning Nondeterministic Classifiers , 2009, J. Mach. Learn. Res..

[9]  Eric P. Xing,et al.  Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[10]  Chuhsing Kate Hsiao,et al.  A new regularized least squares support vector regression for gene selection , 2009, BMC Bioinformatics.

[11]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Tomonari Masada,et al.  Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation , 2009, ADMA.

[13]  C. Campbell,et al.  A marginalized variational bayesian approach to the analysis of array data , 2008, BMC proceedings.

[14]  Alessandro Perina,et al.  Geo-located image analysis using latent representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Shin Ishii,et al.  Optimal Aggregation of Binary Classifiers for Multiclass Cancer Diagnosis Using Gene Expression Profiles , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[18]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[19]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[20]  Nebojsa Jojic,et al.  Free energy score space , 2009, NIPS.

[21]  Alberto Ferrarini,et al.  General and species-specific transcriptional responses to downy mildew infection in a susceptible (Vitis vinifera) and a resistant (V. riparia) grapevine species , 2010, BMC Genomics.

[22]  Manuele Bicego,et al.  Biclustering of Expression Microarray Data with Topic Models , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[25]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[26]  Jean-Cédric Chappelier,et al.  PLSI: The True Fisher Kernel and beyond , 2009, ECML/PKDD.

[27]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[28]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[29]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[30]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Xiyi Hang,et al.  Cancer classification by sparse representation using microarray gene expression data , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[32]  Michele Tansella,et al.  Brain Morphometry by Probabilistic Latent Semantic Analysis , 2010, MICCAI.

[33]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[35]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[36]  Colin Campbell,et al.  The latent process decomposition of cDNA microarray data sets , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[38]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[39]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[40]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Xiaosheng Wang,et al.  A Robust Gene Selection Method for Microarray-based Cancer Classification , 2010, Cancer informatics.

[42]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[43]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[44]  Lei Liu,et al.  Ensemble gene selection by grouping for microarray data classification , 2010, J. Biomed. Informatics.

[45]  A. Osareh,et al.  Classification and Diagnostic Prediction of Cancers Using Gene Microarray Data Analysis , 2009 .

[46]  Nebojsa Jojic,et al.  A hybrid generative/discriminative classification framework based on free-energy terms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  André F. T. Martins,et al.  Combining free energy score spaces with information theoretic kernels: Application to scene classification , 2010, 2010 IEEE International Conference on Image Processing.