Informative transcription factor selection using support vector machine-based generalized approximate cross validation criteria

The genetic regulatory mechanism plays a pivotal role in many biological processes ranging from development to survival. The identification of the common transcription factor binding sites (TFBSs) from a set of known co-regulated gene promoters and the identification of genes that are regulated by the transcription factor (TF) that have important roles in a particular biological function will advance our understanding of the interaction among the co-regulated genes and intricate genetic regulatory mechanism underlying this function. To identify the common TFBSs from a set of known co-regulated gene promoters and classify genes that are regulated by TFs, the new approaches using Support Vector Machine (SVM)-based Generalized Approximate Cross Validation (GACV) criteria are proposed. Two variable selection methods are considered for Recursive Feature Elimination (RFE) and Recursive Feature Addition (RFA). Performances of the proposed methods are compared with the existing SVM-based criteria, Logistic Regression Analysis (LRA), Logic Regression (LR), and Decision Tree (DT) methods by using both two real TF target genes data and the simulated data. In terms of test error rates, the proposed methods perform better than the existing methods.

[1]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[2]  M. LeBlanc,et al.  Logic Regression , 2003 .

[3]  Sandya Liyanarachchi,et al.  Identifying estrogen receptor α target genes using integrated computational genomics and chromatin immunoprecipitation microarray , 2004 .

[4]  Philipp Bucher,et al.  The Eukaryotic Promoter Database (EPD) , 2000, Nucleic Acids Res..

[5]  Dong Xiang,et al.  The Bias-Variance Tradeoff and the Randomized GACV , 1998, NIPS.

[6]  W. Wasserman,et al.  A predictive model for regulatory sequences directing liver-specific transcription. , 2001, Genome research.

[7]  Insuk Sohn,et al.  Classification of gene functions using support vector machine for time-course gene expression data , 2008, Comput. Stat. Data Anal..

[8]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[9]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[10]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[11]  Sin-Ho Jung,et al.  Sample size for FDR-control in microarray data analysis , 2005, Bioinform..

[12]  Mark J. van der Laan,et al.  Regulatory motif finding by logic regression , 2004, Bioinform..

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[16]  Insuk Sohn,et al.  Structured polychotomous machine diagnosis of multiple cancer types using gene expression , 2006, Bioinform..

[17]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[18]  Yoonkyung Lee,et al.  Structured multicategory support vector machines with analysis of variance decomposition , 2006 .

[19]  Richard C. McEachin,et al.  Computationally Identifying Novel NF-κB-Regulated Immune Genes in the Human Genome , 2003 .

[20]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[21]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[22]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[23]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.