Feature selection methods and genomic big data: a systematic review

In the era of accelerating growth of genomic data, feature-selection techniques are believed to become a game changer that can help substantially reduce the complexity of the data, thus making it easier to analyze and translate it into useful information. It is expected that within the next decade, researchers will head towards analyzing the genomes of all living creatures making genomics the main generator of data. Feature selection techniques are believed to become a game changer that can help substantially reduce the complexity of genomic data, thus making it easier to analyze it and translating it into useful information. With the absence of a thorough investigation of the field, it is almost impossible for researchers to get an idea of how their work relates to existing studies as well as how it contributes to the research community. In this paper, we present a systematic and structured literature review of the feature-selection techniques used in studies related to big genomic data analytics.

[1]  D. Megherbi,et al.  Big data biology-based predictive models Via DNA-metagenomics binning for WMD events applications , 2015, 2015 IEEE International Symposium on Technologies for Homeland Security (HST).

[2]  P. Arumugam,et al.  Efficient Decision Tree Based Data Selection and Support Vector Machine Classification , 2018 .

[3]  Georgi Z. Genchev,et al.  Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction. , 2017, Methods.

[4]  Santanu Kumar Rath,et al.  Classification of microarray using MapReduce based proximal support vector machine classifier , 2015, Knowl. Based Syst..

[5]  Matthias Uflacker,et al.  A Framework for the Automatic Combination and Evaluation of Gene Selection Methods , 2018, PACBB.

[6]  Tao Huang,et al.  Predicting and analyzing early wake-up associated gene expressions by integrating GWAS and eQTL studies. , 2017, Biochimica et biophysica acta. Molecular basis of disease.

[7]  Santanu Kumar Rath,et al.  Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor , 2015 .

[8]  David Haws,et al.  MINT: Mutual Information Based Transductive Feature Selection for Genetic Trait Prediction , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Anwar Haque,et al.  Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms , 2015, J. Comput. Sci..

[10]  Christian Böhm,et al.  Identification of SNP interactions using data-parallel primitives on GPUs , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[11]  Qing Chang,et al.  Feature selection methods for big data bioinformatics: A survey from the search perspective. , 2016, Methods.

[12]  Carmen C. Y. Poon,et al.  Big Data for Health , 2015, IEEE Journal of Biomedical and Health Informatics.

[13]  S. Appavu alias Balamurugan,et al.  A Novel Feature Selection Technique for Improved Survivability Diagnosis of Breast Cancer , 2015 .

[14]  Xiangyin Kong,et al.  Prediction of protein N-formylation and comparison with N-acetylation based on a feature selection method , 2016, Neurocomputing.

[15]  Dewan Md. Farid,et al.  A feature grouping method for ensemble clustering of high-dimensional genomic big data , 2016, 2016 Future Technologies Conference (FTC).

[16]  Zhen Ji,et al.  Minimal-redundancy-maximal-relevance feature selection using different relevance measures for omics data classification , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[17]  Panos K. Chrysanthis,et al.  Integrated Theory-and Data-Driven Feature Selection in Gene Expression Data Analysis , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[18]  Joe Naoum-Sawaya,et al.  High dimensional data classification and feature selection using support vector machines , 2018, Eur. J. Oper. Res..

[19]  Mykola Pechenizkiy,et al.  Diversity in search strategies for ensemble feature selection , 2005, Inf. Fusion.

[20]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[21]  Wajdi Dhifli,et al.  MR-SimLab: Scalable subgraph selection with label similarity for big data , 2013, Inf. Syst..

[22]  Bernadette Houghton,et al.  Trustworthiness: Self-assessment of an Institutional Repository against ISO 16363-2012 , 2015, D Lib Mag..

[23]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[24]  Bernard Manderick,et al.  An adaptive rule-based classifier for mining big biological data , 2016, Expert Syst. Appl..

[25]  Behnam Ghavami,et al.  A hybrid framework for reverse engineering of robust Gene Regulatory Networks , 2017, Artif. Intell. Medicine.

[26]  Mikhail Zymbler,et al.  A machine learning approach to analyze customer satisfaction from airline tweets , 2019, Journal of Big Data.

[27]  Dinggang Shen,et al.  Low-Rank Graph-Regularized Structured Sparse Regression for Identifying Genetic Biomarkers , 2017, IEEE Transactions on Big Data.

[28]  Santanu Kumar Rath,et al.  Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier , 2016, J. Biomed. Informatics.

[29]  James M. Hogan,et al.  Supplementary material : large scale read classification for next generation sequencing , 2014, ICCS 2014.

[30]  Kai Petersen,et al.  Systematic Mapping Studies in Software Engineering , 2008, EASE.

[31]  Yuchao Zhang,et al.  Distinguishing three subtypes of hematopoietic cells based on gene expression profiles using a support vector machine. , 2017, Biochimica et biophysica acta. Molecular basis of disease.

[32]  Amr Tolba,et al.  Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics , 2018, Neural Computing and Applications.

[33]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[34]  Songnian Hu,et al.  Analysis of gut microbiota diversity and auxiliary diagnosis as a biomarker in patients with schizophrenia: A cross-sectional study , 2018, Schizophrenia Research.

[35]  Werner Dubitzky,et al.  Avoiding model selection bias in small-sample genomic datasets , 2006, Bioinform..

[36]  Amir-Massoud Bidgoli,et al.  A Hybrid Feature Selection Method to Improve Performance of a Group of Classification Algorithms , 2013, ArXiv.

[37]  M. Thungamani,et al.  Digital genomics to build a smart franchise in real time applications , 2017, 2017 International Conference on Circuit ,Power and Computing Technologies (ICCPCT).

[38]  P. O'Donovan,et al.  Big data in manufacturing: a systematic mapping study , 2015, Journal of Big Data.

[39]  Guanglu Sun,et al.  Feature selection for IoT based on maximal information coefficient , 2018, Future Gener. Comput. Syst..

[40]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[41]  Shaopeng Wang,et al.  Identification of the functional alteration signatures across different cancer types with support vector machine and feature analysis. , 2017, Biochimica et biophysica acta. Molecular basis of disease.

[42]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[43]  Vassilis Christophides,et al.  A greedy feature selection algorithm for Big Data of high dimensionality , 2018, Machine Learning.

[44]  M. West,et al.  Embracing the complexity of genomic data for personalized medicine. , 2006, Genome research.

[45]  Tian Zheng,et al.  Two Screening Methods for Genetic Association Study with Application to Psoriasis Microarray Data Sets , 2015, 2015 IEEE International Congress on Big Data.

[46]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[47]  Ioannis P. Vlahavas,et al.  FIFS: A data mining method for informative marker selection in high dimensional population genomic data , 2017, Comput. Biol. Medicine.