A Review of Microarray Datasets: Where to Find Them and Specific Characteristics.

The advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. This chapter is devoted to reviewing the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, and the so-called dataset shift.

[1]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[5]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[6]  K. Kadota,et al.  Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification , 2003 .

[7]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[8]  Verónica Bolón-Canedo,et al.  Can classification performance be predicted by complexity measures? A study using microarray data , 2017, Knowledge and Information Systems.

[9]  Santanu Kumar Rath,et al.  Classification of microarray using MapReduce based proximal support vector machine classifier , 2015, Knowl. Based Syst..

[10]  Mohammad Kazem Ebrahimpour,et al.  Occam's razor in dimension reduction: Using reduced row Echelon form for finding linear independent features in high dimensional microarray datasets , 2017, Eng. Appl. Artif. Intell..

[11]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[13]  Lluís A. Belanche Muñoz,et al.  Gene subset selection in microarray data using entropic filtering for cancer classification , 2009, Expert Syst. J. Knowl. Eng..

[14]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[15]  Jack Satsangi,et al.  Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis , 2008, Gut.

[16]  Lawrence O. Hall,et al.  Iterative Feature perturbation as a gene Selector for microarray Data , 2012, Int. J. Pattern Recognit. Artif. Intell..

[17]  Li-Yeh Chuang,et al.  A hybrid feature selection method for DNA microarray data , 2011, Comput. Biol. Medicine.

[18]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[19]  Verónica Bolón-Canedo,et al.  Exploring the consequences of distributed feature selection in DNA microarray data , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[20]  Salwani Abdullah,et al.  Hybridizing relieff, mRMR filters and GA wrapper approaches for gene selection , 2012 .

[21]  R. Abseher,et al.  Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[22]  Verónica Bolón-Canedo,et al.  On the use of different base classifiers in multiclass problems , 2017, Progress in Artificial Intelligence.

[23]  Vincent T. Y. Ng,et al.  A Hierarchical Ensemble of ECOC for cancer classification based on multi-class microarray data , 2016, Inf. Sci..

[24]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[25]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[26]  Colas Schretter,et al.  Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[27]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Oleg Okun,et al.  Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors , 2009, Artif. Intell. Medicine.

[29]  Saeid Nahavandi,et al.  Hidden Markov models for cancer classification using gene expression profiles , 2015, Inf. Sci..

[30]  Verónica Bolón-Canedo,et al.  Statistical dependence measure for feature selection in microarray datasets , 2011, ESANN.

[31]  Sreejit Chakravarty,et al.  Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system , 2016, Swarm Evol. Comput..

[32]  Peter Widmayer,et al.  Genevestigator V3: A Reference Expression Database for the Meta-Analysis of Transcriptomes , 2008, Adv. Bioinformatics.

[33]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  C. Wijmenga,et al.  Complex nature of SNP genotype effects on gene expression in primary human leucocytes , 2009, BMC Medical Genomics.

[36]  Rok Blagus,et al.  Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[37]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[38]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[40]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[41]  Yuming Zhou,et al.  Selecting feature subset for high dimensional data via the propositional FOIL rules , 2013, Pattern Recognit..

[42]  Antônio de Pádua Braga,et al.  GA-KDE-Bayes: an evolutionary wrapper method based on non-parametric density estimation applied to bioinformatics problems , 2013, ESANN.

[43]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[44]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[45]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[46]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[47]  Mário A. T. Figueiredo,et al.  An unsupervised approach to feature discretization and selection , 2012, Pattern Recognit..

[48]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[50]  Xindong Wu,et al.  Online feature selection for high-dimensional class-imbalanced data , 2017, Knowl. Based Syst..

[51]  Gary A. Churchill,et al.  Estimating p-values in small microarray experiments , 2007, Bioinform..

[52]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[54]  Verónica Bolón-Canedo,et al.  An insight on complexity measures and classification in microarray data , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[55]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[56]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[57]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[58]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[59]  Susan A. Murphy,et al.  Small Sample Inference for Generalization Error in Classification Using the CUD Bound , 2008, UAI.

[60]  I. Yang,et al.  Molecular staging for survival prediction of colorectal cancer patients. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[61]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[62]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[63]  Qinbao Song,et al.  A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[64]  Yukyee Leung,et al.  A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[65]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Hossein Safari,et al.  A hybrid algorithm for feature subset selection in high-dimensional datasets using FICA and IWSSr algorithm , 2015, Appl. Soft Comput..

[67]  S. Horvath,et al.  Gene Expression Profiling of Gliomas Strongly Predicts Survival , 2004, Cancer Research.

[68]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[69]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[70]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[71]  Michael Müller,et al.  Kupffer cells promote hepatic steatosis via interleukin‐1β–dependent suppression of peroxisome proliferator‐activated receptor α activity , 2010, Hepatology.

[72]  Wei Pan,et al.  Gene expression A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data , 2005 .

[73]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[74]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[75]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[76]  John D. Storey,et al.  SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays , 2003 .

[77]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[78]  Ana Carolina Lorena,et al.  Analysis of complexity indices for classification problems: Cancer gene expression data , 2012, Neurocomputing.

[79]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[80]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[81]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[82]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[83]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[84]  Robert Nadon,et al.  Comparison of small n statistical tests of differential expression applied to microarrays , 2009, BMC Bioinformatics.

[85]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[86]  Nebojsa Jojic,et al.  Feature Selection Using Counting Grids: Application to Microarray Data , 2012, SSPR/SPR.

[87]  Vladimir Nikulin On a solution for the high-dimensionality-small-sample-size regression problem with several different microarrays , 2014, Int. J. Data Min. Bioinform..

[88]  Ali Anaissi,et al.  Feature Selection of Imbalanced Gene Expression Microarray Data , 2011, 2011 12th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[89]  Félix Fernando González Navarro,et al.  Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains , 2011 .

[90]  Yungho Leu,et al.  A novel hybrid feature selection method for microarray data analysis , 2011, Appl. Soft Comput..

[91]  Jianzhong Wang,et al.  Maximum weight and minimum redundancy: A novel framework for feature subset selection , 2013, Pattern Recognit..

[92]  Mario Marchand,et al.  Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Slobodan Vucetic,et al.  Improving accuracy of microarray classification by a simple multi-task feature selection filter , 2011, Int. J. Data Min. Bioinform..

[94]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[95]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[96]  Suyeon Kang,et al.  Robust gene selection methods using weighting schemes for microarray data analysis , 2017, BMC Bioinformatics.

[97]  Edward R. Dougherty,et al.  Small Sample Issues for Microarray-Based Classification , 2001, Comparative and functional genomics.

[98]  Muchenxuan Tong,et al.  Genetic Programming Based Ensemble System for Microarray Data Classification , 2015, Comput. Math. Methods Medicine.

[99]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[100]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[101]  Verónica Bolón-Canedo,et al.  Data complexity measures for analyzing the effect of SMOTE over microarrays , 2016, ESANN.

[102]  Verónica Bolón-Canedo,et al.  Data classification using an ensemble of filters , 2014, Neurocomputing.

[103]  Nicoletta Dessì,et al.  Similarity of feature selection methods: An empirical study across data intensive classification tasks , 2015, Expert Syst. Appl..

[104]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[105]  Jayant P. Menon,et al.  Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. , 2006, Cancer cell.

[106]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[107]  Krzysztof Fujarewicz,et al.  Stable feature selection and classification algorithms for multiclass microarray data , 2012, Biology Direct.

[108]  Zhen Liu,et al.  A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data , 2017, Neurocomputing.

[109]  Verónica Bolón-Canedo,et al.  On the effectiveness of discretization on gene selection of microarray data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[110]  P. Sebastiani,et al.  Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer , 2007, Nature Medicine.