Data and text mining Visualization-based cancer microarray data classification analysis

MOTIVATION Methods for analyzing cancer microarray data often face two distinct challenges: the models they infer need to perform well when classifying new tissue samples while at the same time providing an insight into the patterns and gene interactions hidden in the data. State-of-the-art supervised data mining methods often cover well only one of these aspects, motivating the development of methods where predictive models with a solid classification performance would be easily communicated to the domain expert. RESULTS Data visualization may provide for an excellent approach to knowledge discovery and analysis of class-labeled data. We have previously developed an approach called VizRank that can score and rank point-based visualizations according to degree of separation of data instances of different class. We here extend VizRank with techniques to uncover outliers, score features (genes) and perform classification, as well as to demonstrate that the proposed approach is well suited for cancer microarray analysis. Using VizRank and radviz visualization on a set of previously published cancer microarray data sets, we were able to find simple, interpretable data projections that include only a small subset of genes yet do clearly differentiate among different cancer types. We also report that our approach to classification through visualization achieves performance that is comparable to state-of-the-art supervised data mining techniques. AVAILABILITY VizRank and radviz are implemented as part of the Orange data mining suite (http://www.ailab.si/orange). SUPPLEMENTARY INFORMATION Supplementary data are available from http://www.ailab.si/supp/bi-cancer.

[1]  Martin Charlton,et al.  An Investigation of Methods for Visualising Highly Multivariate Datasets , 1998 .

[2]  R. Hardy,et al.  B cell development pathways. , 2001, Annual review of immunology.

[3]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997 .

[4]  Ivan Bratko,et al.  Microarray data mining with visual programming , 2005, Bioinform..

[5]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[6]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[7]  K. Possinger,et al.  Molecular genetic characteristics of lung cancer--useful as real' tumor markers? , 1999, Lung cancer.

[8]  Ivan Bratko,et al.  VizRank: Data Visualization Guided by Machine Learning , 2006, Data Mining and Knowledge Discovery.

[9]  Ruth M. Pfeiffer,et al.  Graphical Methods for Class Prediction Using Dimension Reduction Techniques on DNA Microarray Data , 2003, Bioinform..

[10]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[11]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[13]  A. Boulesteix Statistical Applications in Genetics and Molecular Biology PLS Dimension Reduction for Classification with Microarray Data , 2011 .

[14]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[15]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[16]  K. Marx,et al.  Applications of Machine Learning and High‐Dimensional Visualization in Cancer Detection, Diagnosis, and Management , 2004, Annals of the New York Academy of Sciences.

[17]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[18]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[19]  Mei-Ling Ting Lee,et al.  Analysis of Microarray Gene Expression Data , 2004, Springer US.

[20]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[21]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[24]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[25]  D. Birnbaum,et al.  Expression of the FMS/KIT-like gene FLT3 in human acute leukemias of the myeloid and lymphoid lineages. , 1992, Blood.

[26]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[27]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[30]  Leslie Grate,et al.  Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery , 2005, BMC Bioinformatics.

[31]  T. Pham,et al.  Analysis of Microarray Gene Expression Data , 2006 .