A Novel Approach for Classifying Gene Expression Data using Topic Modeling

Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA in clustering as well as in classification of cancer and healthy tissues using lung cancer and breast cancer messenger RNA (mRNA) sequencing data. We describe our study in three phases: clustering, classification, and gene interpretation. First, LDA is used as a clustering algorithm to group the data in an unsupervised manner. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Lastly, we present a functional analysis of the genes identified using a novel topic profile matrix formulation. This analysis identified several genes and pathways that could potentially be involved in differentiating tumor samples from normal. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

[1]  C. Elkan,et al.  Topic Models , 2008 .

[2]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[3]  Xin Chen,et al.  Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Qiang Sun,et al.  Individual-level analysis of differential expression of genes and pathways for personalized medicine , 2015, Bioinform..

[5]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[8]  F Azuaje Interpretation of genome expression patterns: computational challenges and opportunities. , 2000, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[9]  Tomonari Masada,et al.  Bayesian Multi-topic Microarray Analysis with Hyperparameter Reestimation , 2009, ADMA.

[10]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[11]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[12]  H. Idikio,et al.  Human Cancer Classification: A Systems Biology- Based Model Integrating Morphology, Cancer Stem Cells, Proteomics, and Genomics , 2011, Journal of Cancer.

[13]  Min Song,et al.  Detecting the knowledge structure of bioinformatics by mining full-text collections , 2012, Scientometrics.

[14]  Samuel Kaski,et al.  A simple infinite topic mixture for rich graphs and relational data , 2008 .

[15]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[16]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[17]  Colin Campbell,et al.  The Latent Process Decomposition of cDNA Microarray Data Sets , 2005, TCBB.

[18]  Yuping Wang,et al.  A novel procedure on next generation sequencing data analysis using text mining algorithm , 2016, BMC Bioinformatics.

[19]  Wenhan Luo,et al.  Automatic Topic Discovery for Multi-Object Tracking , 2015, AAAI.

[20]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[21]  Iver Petersen,et al.  The morphological and molecular diagnosis of lung cancer. , 2011, Deutsches Arzteblatt international.

[22]  Pietro Liò,et al.  Exploring the complexity of pathway-drug relationships using latent Dirichlet allocation , 2014, Comput. Biol. Chem..

[23]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[24]  Weizhong Zhao,et al.  Topic modeling for cluster analysis of large biological and medical datasets , 2014, BMC Bioinformatics.

[25]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[26]  Xiaowei Xu,et al.  Mining FDA drug labels using an unsupervised learning technique - topic modeling , 2011, BMC Bioinformatics.

[27]  Xiaoyan Zhu,et al.  Extract interaction detection methods from the biological literature , 2009, BMC Bioinformatics.

[28]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.