Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping

BackgroundMany supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data.ResultsWe applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most.ConclusionsThe classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.

[1]  Hua Wang,et al.  A Comparative Study of Classification Methods For Microarray Data Analysis , 2006, AusDM.

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Alex Lewin,et al.  MMBGX: a method for estimating expression at the isoform level and detecting differential splicing using whole-transcript Affymetrix arrays , 2009, Nucleic acids research.

[4]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[6]  France T́elécom,et al.  Optimal Bin Number for Equal Frequency Discretizations in Supervized Learning , 2007 .

[7]  S. Knudsen,et al.  A new non-linear normalization method for reducing variability in DNA microarray experiments , 2002, Genome Biology.

[8]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Richard Simon,et al.  What should physicians look for in evaluating prognostic gene-expression signatures? , 2010, Nature Reviews Clinical Oncology.

[10]  Heping Zhang,et al.  Cell and tumor classification using gene expression data: Construction of forests , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Robert Veroff,et al.  A Bayesian Network Classification Methodology for Gene Expression Data , 2004, J. Comput. Biol..

[12]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[13]  Lajos Pusztai,et al.  Chips to Bedside: Incorporation of Microarray Data into Clinical Practice , 2006, Clinical Cancer Research.

[14]  Chun Li,et al.  Strategy for encoding and comparison of gene expression signatures , 2007, Genome Biology.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[17]  S. P. Fodor,et al.  Light-directed, spatially addressable parallel chemical synthesis. , 1991, Science.

[18]  W. Kamps,et al.  Evidence Based Selection of Housekeeping Genes , 2007, PloS one.

[19]  Ramón Díaz-Uriarte,et al.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest , 2007, BMC Bioinformatics.

[20]  Lili Liu,et al.  Comparative study of discretization methods of microarray data for inferring transcriptional regulatory networks , 2010, BMC Bioinformatics.

[21]  P. Kleihues,et al.  Population-based studies on incidence, survival rates, and genetic alterations in astrocytic and oligodendroglial gliomas. , 2005, Journal of neuropathology and experimental neurology.

[22]  Riccardo Bellazzi,et al.  A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays , 2006, BMC Bioinformatics.

[23]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  D. Corey,et al.  RNA sequencing: platform selection, experimental design, and data interpretation. , 2012, Nucleic acid therapeutics.

[26]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[27]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[28]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[29]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[30]  Hendrik Witt,et al.  Medulloblastoma comprises four distinct molecular variants. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[31]  Ryan D. Morin,et al.  Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. , 2008, BioTechniques.

[32]  Luke Macyszyn,et al.  Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes , 2014, Nucleic acids research.

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  C. Sotiriou,et al.  Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care? , 2007, Nature Reviews Cancer.

[35]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[36]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[37]  Howard A. Fine,et al.  Predicting in vitro drug sensitivity using Random Forests , 2011, Bioinform..