Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data

Epigenetic Biomarker discovery is an important task in bioinformatics. In this article, we develop a new framework of identifying statistically significant epigenetic biomarkers using maximal-relevance and minimal-redundancy criterion based feature (gene) selection for multi-omics dataset. Firstly, we determine the genes that have both expression as well as methylation values, and follow normal distribution. Similarly, we identify the genes which consist of both expression and methylation values, but do not follow normal distribution. For each case, we utilize a gene-selection method that provides maximal-relevant, but variable-weighted minimum-redundant genes as top ranked genes. For statistical validation, we apply t-test on both the expression and methylation data consisting of only the normally distributed top ranked genes to determine how many of them are both differentially expressed andmethylated. Similarly, we utilize Limma package for performing non-parametric Empirical Bayes test on both expression and methylation data comprising only the non-normally distributed top ranked genes to identify how many of them are both differentially expressed and methylated. We finally report the top-ranking significant gene-markerswith biological validation. Moreover, our framework improves positive predictive rate and reduces false positive rate in marker identification. In addition, we provide a comparative analysis of our gene-selection method as well as othermethods based on classificationperformances obtained using several well-known classifiers.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Jian Zhang,et al.  SOD3 acts as a tumor suppressor in PC-3 prostate cancer cells via hydrogen peroxide accumulation. , 2014, Anticancer research.

[3]  Sandhya Mehrotra,et al.  Combinatorial Control of Gene Expression , 2013, BioMed research international.

[4]  Ujjwal Maulik,et al.  MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset , 2015, J. Biomed. Informatics.

[5]  B. Goertzel,et al.  Application of MUTIC to the exploration of gene expression data in prostate cancer. , 2007, Genetics and molecular research : GMR.

[6]  Young-Koo Lee,et al.  An Improved Maximum Relevance and Minimum Redundancy Feature Selection Algorithm Based on Normalized Mutual Information , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[7]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ujjwal Maulik,et al.  RANWAR: Rank-Based Weighted Association Rule Mining From Gene Expression and Methylation Data , 2015, IEEE Transactions on NanoBioscience.

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  A. Ghasemi,et al.  Normality Tests for Statistical Analysis: A Guide for Non-Statisticians , 2012, International journal of endocrinology and metabolism.

[11]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[12]  Shutao Li,et al.  Gene Selection Using Wilcoxon Rank Sum Test and Support Vector Machine for Cancer Classification , 2007, CIS.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  C. Nelson,et al.  Exosomes in Prostate Cancer: Putting Together the Pieces of a Puzzle , 2013, Cancers.

[15]  Ujjwal Maulik,et al.  Integrated analysis of gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[16]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[17]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[18]  Andrew J Vickers,et al.  Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data , 2005, BMC medical research methodology.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[20]  Ujjwal Maulik,et al.  IDPT: Insights into potential intrinsically disordered proteins through transcriptomic analysis of genes for prostate carcinoma epigenetic data. , 2016, Gene.

[21]  K. Strimbu,et al.  What are biomarkers? , 2010, Current opinion in HIV and AIDS.

[22]  Anirban Mukhopadhyay,et al.  A Survey and Comparative Study of Statistical Tests for Identifying Differential Expression from Microarray Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  J. Witte,et al.  Antioxidant and vitamin E transport genes and risk of high‐grade prostate cancer and prostate cancer recurrence , 2013, The Prostate.

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[26]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[27]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[28]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[29]  N. Dessì,et al.  A Comparative Analysis of Biomarker Selection Techniques , 2013, BioMed research international.

[30]  Ujjwal Maulik,et al.  Variable Weighted Maximal Relevance Minimal Redundancy Criterion for Feature Selection Using Normalized Mutual Information , 2015, J. Multiple Valued Log. Soft Comput..

[31]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[32]  M. Mikuła,et al.  DNA methylation status is more reliable than gene expression at detecting cancer in prostate biopsy , 2014, British Journal of Cancer.

[33]  T. Kislinger,et al.  In‐depth proteomic analyses of exosomes isolated from expressed prostatic secretions in urine , 2013, Proteomics.