An Integrative System For Prediction Of Nac Proteins In Rice Using Different Feature Extraction Methods

The NAC gene family encodesa large family of plant-specific transcription factors with diverse roles in various developmental processes and stress responsesin plants. Creation of genome wide prediction tools for NAC proteins will have a significant impacton gene annotationin rice. In the present study, NACSVM, a tool for computational genome-scale prediction of NAC proteins in rice was developed integrating compositional and evolutionary information ofNAC proteins. Initially, support vector machine (SVM) based modules were developed using combinatorial presence of diverse protein features such as traditional amino acid, dipeptide (i+1), tripeptide (i+2 ), four-parts composition and PSSM and an overall accuracy of 79%, 93%, 93%, 79%and 100% respectively was achieved. Later, two hybrid modules were developed based on amino acid , dipeptide and tripeptide composition, through which an overall accuracy of 83% and 79% was achieved. NACSVM wasalso evaluated using position-specific iterated – basic local alignment search tool which resulted in a lower accuracy of 50%. In order to benchmark NAC SVM , the tool was evaluated using independent data test and cross validation methods.The different statistical analyses carried out revealed that the proposed algorithm isan useful tool for annotating NAC proteins in genome of rice.

[1]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[2]  Matthew W. Hahn,et al.  The evolution of transcriptional regulation in eukaryotes. , 2003, Molecular biology and evolution.

[3]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[4]  R. R. Samaha,et al.  Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. , 2000, Science.

[5]  L. Xiong,et al.  Systematic Sequence Analysis and Identification of Tissue-specific or Stress-responsive Genes of Nac Transcription Factor Family in Rice , 2008 .

[6]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[7]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[10]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[11]  Ao Li,et al.  LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST , 2005, Nucleic Acids Res..

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Improving prediction of protein subcellular localization using evolutionary information and sequence-order information , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[14]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[15]  E. Merzari,et al.  Large-Scale Simulations on Thermal-Hydraulics in Fuel Bundles of Advanced Nuclear Reactors , 2007 .

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[18]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.