Chemometrics for QSAR with low sequence homology: Mycobacterial promoter sequences recognition with 2D-RNA entropies

Abstract Predicting mycobacterial sequences promoter of protein synthesis is important in the study of protein metabolism regulation. This goal is however considered a challenging computational biology task due to low inter-sequences homology. Consequently, a previous work based only on DNA sequence had to use a large input parameter set and multilayered feed-forward ANN architecture trained using the error-back-propagation algorithm to raise an overall accuracy up to 97% [Kalate, et al. 2003. Comput. Biol. Chem. 27, 555–564]. Subsequently, one could expect that a notably simpler model may be derived using parameters based on non-linear structural information. In the present work, a method based on molecular folding negentropies ( Θ k ) is introduced to predict by the first time mycobacterial promoter sequences (mps) from the corresponding RNA secondary structure. The best QSAR equation found was the classification function mps = 4.921 ×  0 Θ M  − 1.205, which recognised 126/135 mps (93.3%) and 100% of 245 control sequences (cs). The model have shown a very high Mathew regression coefficient C  = 0.949. Both average overall accuracy and predictability were 97.6%. Additionally, several machine learning algorithms were applied in order to reaffirm the validity of the LDA model from the chemometrics point of view. This linear model with only one parameter ( 0 Θ M ) may be considered the simpler reported up-to-date by large, without lack of accuracy (97%) with respect to Kalate et al.'s model.

[1]  Martin E. Mulligan,et al.  Analysis of the occurrence of promoter-sites in DNA , 1986, Nucleic Acids Res..

[2]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[3]  W R Strohl,et al.  Compilation and analysis of DNA sequences associated with apparent streptomycete promoters. , 1992, Nucleic acids research.

[4]  Maykel Pérez González,et al.  A topological sub-structural approach to the mutagenic activity in dental monomers. 2. Cycloaliphatic epoxides , 2004 .

[5]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[6]  Kathleen Marchal,et al.  Computational Approaches to Identify Promoters and cis-Regulatory Elements in Plant Genomes1 , 2003, Plant Physiology.

[7]  Robert Entriken,et al.  Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity , 1984, Nucleic Acids Res..

[8]  P. Herdewijn,et al.  A neural network for predicting the stability of RNA/DNA hybrid duplexes , 2004 .

[9]  Zheng Yuan Prediction of protein subcellular locations using Markov chain models , 1999, FEBS letters.

[10]  Humberto González Díaz,et al.  Stochastic molecular descriptors for polymers. 1. Modelling the properties of icosahedral viruses with 3D-Markovian negentropies , 2004 .

[11]  Maykel Pérez González,et al.  A topological sub-structural approach of the mutagenic activity in dental monomers. 1. Aromatic epoxides , 2004 .

[12]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[13]  M. Bashyam,et al.  A study of mycobacterial transcriptional apparatus: identification of novel features in promoter elements , 1996, Journal of bacteriology.

[14]  Sanjeev S. Tambe,et al.  Artificial neural networks for prediction of mycobacterial promoter sequences , 2003, Comput. Biol. Chem..

[15]  L.M.C. Buydens,et al.  Circular effects in representations of an RNA nucleotides data set in relation with principal components analysis , 2001 .

[16]  R. B. Alzina,et al.  Introducción conceptual al análisis multivariable: un enfoque informático con los paquetes SPSS-X, BMDP, LISREL y SPAD , 1989 .

[17]  Miguel A. Cabrera,et al.  TOPS-MODE approach for the prediction of blood-brain barrier permeation. , 2004, Journal of pharmaceutical sciences.

[18]  R M Harshey,et al.  Rate of ribonucleic acid chain growth in Mycobacterium tuberculosis H37Rv , 1977, Journal of bacteriology.

[19]  Eugenio Uriarte,et al.  Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants , 2004, Proteins.

[20]  C. Locht,et al.  Analysis of the Mycobacterium tuberculosis 85A antigen promoter region , 1995, Journal of bacteriology.

[21]  Humberto González-Díaz,et al.  Markov entropy backbone electrostatic descriptors for predicting proteins biological activity. , 2004, Bioorganic & medicinal chemistry letters.

[22]  J. Francis Statistica for Windows , 1995 .

[23]  Kathleen Marchal,et al.  PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences , 2002, Nucleic Acids Res..

[24]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[25]  Humberto González Díaz,et al.  Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 -RNA packaging region with drugs , 2003, Bioinform..

[26]  M. A. Cabrera Pérez,et al.  In silico prediction of central nervous system activity of compounds. Identification of potential pharmacophores by the TOPS-MODE approach. , 2004, Bioorganic & medicinal chemistry.

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[28]  B. Gopal,et al.  The mechanism of upstream activation in the rrnB operon of Mycobacterium smegmatis is different from the Escherichia coli paradigm. , 2005, Microbiology.

[29]  L B Kier,et al.  Use of molecular negentropy to encode structure governing biological activity. , 1980, Journal of pharmaceutical sciences.

[30]  M. O'Neill,et al.  Escherichia coli promoters. II. A spacing class-dependent promoter search protocol. , 1989, The Journal of biological chemistry.

[31]  Rupali N. Kalate,et al.  Analysis of DNA curvature distribution in mycobacterial promoters using theoretical models. , 2002, Biophysical chemistry.

[32]  Humberto González-Díaz,et al.  Proteins Markovian 3D-QSAR with spherically-truncated average electrostatic potentials. , 2005, Bioorganic & medicinal chemistry.

[33]  Maykel Pérez González,et al.  TOPS-MODE approach to predict mutagenicity in dental monomers , 2004 .

[34]  Rafael Molina,et al.  Stochastic molecular descriptors for polymers. 2. Spherical truncation of electrostatic interactions , 2005 .

[35]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[36]  Han van de Waterbeemd,et al.  Chemometric methods in molecular design , 1995 .