Feature selection based on functional group structure for microRNA expression data analysis

Feature selection methods have been widely used in gene expression analysis to identify differentially expressed genes and explore potential biomarkers for complex diseases. While a lot of studies have shown that incorporating feature structure information can greatly enhance the performance of feature selection algorithms, and genes naturally fall into groups with regard to common function and co-regulation, only a few of gene expression studies utilized the structured properties. And, as far as we know, there has been no such study on microRNA (miRNA) expression analysis due to the lack of available functional annotation for miRNAs. In this study, we focus on miRNA expression analysis because of its importance in the diagnosis, prognosis prediction and new therapeutic target detection for complex diseases. MiRNAs tend to work in groups to play their regulation roles, thus the miRNA expression data also has group structure. We utilize the GO-based semantic similarity to infer miRNA functional groups, and propose a new feature selection method taking group structure into consideration, called MiRFFS (MiRNA Functional group-based Feature Selection). We also apply the group information to the sparse group Lasso method, and compare MiRFFS with the sparse group Lasso as well as some existing feature selection methods. The results on three miRNA microarray profiles of breast cancer show that MiRFFS can achieve a compact feature subset with high classification accuracy.

[1]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[2]  Yadong Wang,et al.  miR2Disease: a manually curated database for microRNA deregulation in human disease , 2008, Nucleic Acids Res..

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  W. Kruskal Historical Notes on the Wilcoxon Unpaired Two-Sample Test , 1957 .

[5]  Seungyoon Nam,et al.  Prediction of Mammalian MicroRNA Targets - Comparative Genomics Approach with Longer 3' UTR Databases , 2005 .

[6]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[7]  Richard Mankiewicz The Story of Mathematics , 2001 .

[8]  Olivier Bodenreider,et al.  Ontology-driven similarity approaches to supporting gene func- tional assessment , 2005 .

[9]  Yang Yang A new similarity measure over Gene Ontology with application to protein subcellular localization , 2010, 2010 3rd International Conference on Biomedical Engineering and Informatics.

[10]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[11]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[12]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[13]  Jean-Louis Foulley,et al.  A structural mixed model for variances in differential gene expression studies. , 2007, Genetical research.

[14]  Niels Richard Hansen,et al.  Sparse group lasso and high dimensional multinomial classification , 2012, Comput. Stat. Data Anal..

[15]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[16]  Carme Camps,et al.  microRNA-associated progression pathways and potential therapeutic targets identified by integrated mRNA and microRNA expression profiling in breast cancer. , 2011, Cancer research.

[17]  Jieping Ye,et al.  Sparse methods for biomedical data , 2012, SKDD.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[20]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[21]  Fabian J Theis,et al.  PhenomiR: a knowledgebase for microRNA expression in diseases and biological processes , 2010, Genome Biology.

[22]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[23]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[24]  Yadong Wang,et al.  Towards integrative gene functional similarity measurement , 2014, BMC Bioinformatics.

[25]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[26]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[27]  J. Downward Targeting RAS signalling pathways in cancer therapy , 2003, Nature Reviews Cancer.

[28]  Anton J. Enright,et al.  Human MicroRNA Targets , 2004, PLoS biology.

[29]  Eva E. Rufino-Palomares,et al.  MicroRNAs as Oncogenes and Tumor Suppressors , 2013 .

[30]  Martin Reczko,et al.  DIANA miRPath v.2.0: investigating the combinatorial effect of microRNAs in pathways , 2012, Nucleic Acids Res..

[31]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[32]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[33]  Yang Li,et al.  HMDD v2.0: a database for experimentally supported human microRNA and disease associations , 2013, Nucleic Acids Res..

[34]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[35]  Eytan Domany,et al.  Atom-efficient synthesis of 2,4,6-trisubstituted 1,3,5-triazines via Fe-catalyzed cyclization of aldehydes with NH4I as the sole nitrogen source , 2020, RSC advances.

[36]  Chiara Romualdi,et al.  miR148b is a major coordinator of breast cancer progression in a relapse‐associated microRNA signature by targeting ITGA5, ROCK1, PIK3CA, NRAS, and CSF1 , 2013, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[37]  C. Burge,et al.  Prediction of Mammalian MicroRNA Targets , 2003, Cell.

[38]  Athanasios Fevgas,et al.  DIANA-TarBase v7.0: indexing more than half a million experimentally supported miRNA:mRNA interactions , 2014, Nucleic Acids Res..

[39]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[40]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[41]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[42]  K. Gunsalus,et al.  Combinatorial microRNA target predictions , 2005, Nature Genetics.

[43]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[44]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[45]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[46]  Paul Horton,et al.  Network-based de-noising improves prediction from microarray data , 2006, BMC Bioinformatics.

[47]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[48]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .