Analysis of complexity indices for classification problems: Cancer gene expression data

Currently, cancer diagnosis at a molecular level has been made possible through the analysis of gene expression data. More specifically, one usually uses machine learning (ML) techniques to build, from cancer gene expression data, automatic diagnosis models (classifiers). Cancer gene expression data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution. We investigate the results of a set of indices able to extract the intrinsic complexity information from the data. Such measures can be used to analyze, among other things, which particular characteristics of cancer gene expression data mostly impact the prediction ability of support vector machine classifiers. In this context, we also show that, by applying a proper feature selection procedure to the data, one can reduce the influence of those characteristics in the error rates of the classifiers induced.

[1]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[2]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[3]  Ivan G. Costa,et al.  Mining Rules for the Automatic Selection Process of Clustering Methods Applied to Cancer Gene Expression Data , 2009, ICANN.

[4]  R. Bernards,et al.  Enabling personalized cancer medicine through analysis of gene-expression patterns , 2008, Nature.

[5]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[6]  Oleg Okun,et al.  Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors , 2009, Artif. Intell. Medicine.

[7]  Ana Carolina Lorena,et al.  On the Complexity of Gene Marker Selection , 2010, 2010 Eleventh Brazilian Symposium on Neural Networks.

[8]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[9]  Achim Tresch,et al.  Classification across gene expression microarray studies , 2009, BMC Bioinformatics.

[10]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[11]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[13]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[14]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[15]  FRED W. SMITH,et al.  Pattern Classifier Design by Linear Programming , 1968, IEEE Transactions on Computers.

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Rainer Spang,et al.  Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. , 2003, Drug discovery today.

[18]  Marcel J. T. Reinders,et al.  A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability , 2009, BMC Bioinformatics.

[19]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[22]  Rainer Spang,et al.  Computational diagnostics with gene expression profiles. , 2008, Methods in molecular biology.

[23]  Catalin C. Barbacioru,et al.  The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies , 2008, BMC Bioinformatics.

[24]  Sylvia Richardson,et al.  Statistical Applications in Genetics and Molecular Biology Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods , 2011 .

[25]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[26]  Ana Carolina Lorena,et al.  Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets , 2009, BSB.

[27]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[28]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[29]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[30]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[31]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Benjamin Haibe-Kains,et al.  A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? , 2008, Bioinform..

[33]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[36]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[37]  Tin Kam Ho,et al.  Classifier Domains of Competence in Data Complexity Space , 2006 .

[38]  Ian Witten,et al.  Data Mining , 2000 .

[39]  Ana Carolina Lorena,et al.  Complexity measures of supervised classifications tasks: A case study for cancer gene expression data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[40]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[41]  Ana Carolina Lorena,et al.  On the Complexity of Gene Expression Classification Data Sets , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[42]  João Gama,et al.  On Data and Algorithms: Understanding Inductive Performance , 2004, Machine Learning.

[43]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[44]  R. Greenberg Biometry , 1969, The Yale Journal of Biology and Medicine.

[45]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Empirical Evaluation of Ranking Prediction Methods for Gene Expression Data Classification , 2010, IBERAMIA.