Integrated Statistical and Rule-Mining Techniques for Dna Methylation and Gene Expression Data Analysis

Abstract For determination of the relationships among significant gene markers, statistical analysis and association rule mining are considered as very useful protocols. The first protocol identifies the significant differentially expressed/methylated gene markers, whereas the second one produces the interesting relationships among them across different types of samples or conditions. In this article, statistical tests and association rule mining based approaches have been used on gene expression and DNA methylation datasets for the prediction of different classes of samples (viz., Uterine Leiomyoma/class-formersmoker and uterine myometrium/class-neversmoker). A novel rule-based classifier is proposed for this purpose. Depending on sixteen different rule-interestingness measures, we have utilized a Genetic Algorithm based rank aggregation technique on the association rules which are generated from the training set of data by Apriori association rule mining algorithm. After determining the ranks of the rules, we have conducted a majority voting technique on each test point to estimate its class-label through weighted-sum method. We have run this classifier on the combined dataset using 4-fold cross-validations, and thereafter a comparative performance analysis has been made with other popular rulebased classifiers. Finally, the status of some important gene markers has been identified through the frequency analysis in the evolved rules for the two class-labels individually to formulate the interesting associations among them.

[1]  Shingo Mabu,et al.  Analysis of Various Interestingness Measures in Class Association Rule Mining , 2011 .

[2]  Jian-Bing Fan,et al.  GoldenGate assay for DNA methylation profiling. , 2009, Methods in molecular biology.

[3]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[4]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[5]  Vasile Palade,et al.  Building interpretable fuzzy models for high dimensional data analysis in cancer diagnosis , 2011, BMC Genomics.

[6]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[7]  Vasyl Pihur,et al.  RankAggreg, an R package for weighted rank aggregation , 2009, BMC Bioinformatics.

[8]  Ujjwal Maulik,et al.  On Biclustering of Gene Expression Data , 2010 .

[9]  Paola Todeschini,et al.  Uterine and ovarian carcinosarcomas overexpressing Trop-2 are sensitive to hRS7, a humanized anti-Trop-2 antibody , 2011, Journal of experimental & clinical cancer research : CR.

[10]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[11]  Jae Won Lee,et al.  Comparison of various statistical methods for identifying differential gene expression in replicated microarray data , 2006, Statistical methods in medical research.

[12]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[13]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[14]  Prerna Sethi,et al.  Association Rule Based Similarity Measures for the Clustering of Gene Expression Data , 2010, The open medical informatics journal.

[15]  S. Mohamed,et al.  Statistical Normalization and Back Propagation for Classification , 2022 .

[16]  Jun Sese,et al.  Genome-Wide DNA Methylation and Gene Expression Analyses of Monozygotic Twins Discordant for Intelligence Levels , 2012, PloS one.

[17]  Ujjwal Maulik,et al.  Mining association rules from HIV-human protein interactions , 2010, 2010 International Conference on Systems in Medicine and Biology.

[18]  Yong Xu,et al.  Neuro-Fuzzy Ensemble Approach for Microarray Cancer Gene Expression Data Analysis , 2006, 2006 International Symposium on Evolving Fuzzy Systems.

[19]  Peter Langfelder,et al.  Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects , 2012, BMC Genomics.

[20]  S. Wacholder,et al.  Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival , 2008, PloS one.

[21]  Jian-Jun Wei,et al.  Genome-Wide DNA Methylation Indicates Silencing of Tumor Suppressor Genes in Uterine Leiomyoma , 2012, PloS one.

[22]  Ujjwal Maulik,et al.  Integrated analysis of gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[23]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[24]  Andrew J Vickers,et al.  Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data , 2005, BMC medical research methodology.

[25]  S. Ray,et al.  Predicting annotated HIV-1-Human PPIs using a biclustering approach to association rule mining , 2012, 2012 Third International Conference on Emerging Applications of Information Technology.

[26]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[27]  Fionn Murtagh,et al.  Weighted Association Rule Mining using weighted support and significance framework , 2003, KDD '03.

[28]  Ujjwal Maulik,et al.  A Novel Biclustering Approach to Association Rule Mining for Predicting HIV-1–Human Protein Interactions , 2012, PloS one.

[29]  John D. Storey,et al.  False Discovery Rate , 2020, International Encyclopedia of Statistical Science.

[30]  M. Payson,et al.  Strategy for elucidating differentially expressed genes in leiomyomata identified by microarray technology. , 2003, Fertility and sterility.

[31]  N. Chegini,et al.  Gene Expression Profiling of Leiomyoma and Myometrial Smooth Muscle Cells in Response to Transforming Growth Factor-β , 2005 .

[32]  José María Carazo,et al.  Integrated analysis of gene expression by association rules discovery , 2006, BMC Bioinformatics.

[33]  M. Anandhavalli,et al.  Interestingness Measure for Mining Spatial Gene Expression Data using Association Rule , 2010, ArXiv.