Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering

In data intensive computing environments where the number of samples and data dimensions grow sufficiently large, existing methods in Bioinformatics research are not effective for selecting important genes. In this chapter, we propose two approaches for parallel selection of genes, both are based on the well known { ReliefF} feature selection method and cluster computing environments. In the first design, denoted by { PReliefF} p , the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the { ReliefF} method on its own subset, without interaction with other clusters. The final ranking of the genes for selection is generated by gathering weight vectors from all nodes. In the second design, namely { PReliefF} g , each node dynamically updates global weight vectors so the gene selection results in one node can be used to boost the selection process for other nodes. Experimental results from real-world microarray expression data show that { PReliefF} p and { PReliefF} g nearly perfectly speedup to the number of nodes involved in the computing. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better methods than the genes selected from the original ReliefF method.

[1]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[4]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[5]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[6]  Arnon Rosenthal,et al.  Methodological Review: Cloud computing: A new business paradigm for biomedical information sharing , 2010 .

[7]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[8]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[9]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[10]  Yingdong Zhao,et al.  How Large a Training Set is Needed to Develop a Classifier for Microarray Data? , 2008, Clinical Cancer Research.

[11]  Stuart G. Baker,et al.  Identifying genes that contribute most to good classification in microarrays , 2006, BMC Bioinformatics.

[12]  Imad Mahgoub,et al.  Parallel Selection of Informative Genes for Classification , 2009, BICoB.

[13]  Xindong Wu,et al.  CLAP: Collaborative pattern mining for distributed information systems , 2011, Decis. Support Syst..

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Hiroshi Mamitsuka,et al.  Selecting features in microarray classification using ROC curves , 2006, Pattern Recognit..

[16]  Hong-Wen Deng,et al.  Gene selection for classification of microarray data based on the Bayes error , 2007, BMC Bioinformatics.

[17]  R. Plackett,et al.  Karl Pearson and the Chi-squared Test , 1983 .

[18]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[19]  Stephen J. Roberts,et al.  A theoretical analysis of gene selection , 2004 .

[20]  Rork Kuick,et al.  Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer. , 2003, Cancer research.

[21]  Geoffrey C. Fox,et al.  Biomedical Case Studies in Data Intensive Computing , 2009, CloudCom.

[22]  Xavier Llorà Data-intensive computing for competent genetic algorithms: a pilot study using meandre , 2009, GECCO '09.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .