Exploring the Adenylation Domain Repertoire of Nonribosomal Peptide Synthetases Using an Ensemble of Sequence-Search Methods

The introduction of two-dimension (2D) graphs and their numerical characterization for comparative analyses of DNA/RNA and protein sequences without the need of sequence alignments is an active yet recent research topic in bioinformatics. Here, we used a 2D artificial representation (four-color maps) with a simple numerical characterization through topological indices (TIs) to aid the discovering of remote homologous of Adenylation domains (A-domains) from the Nonribosomal Peptide Synthetases (NRPS) class in the proteome of the cyanobacteria Microcystis aeruginosa. Cyanobacteria are a rich source of structurally diverse oligopeptides that are predominantly synthesized by NPRS. Several A-domains share amino acid identities lower than 20 % being a possible source of remote homologous. Therefore, A-domains cannot be easily retrieved by BLASTp searches using a single template. To cope with the sequence diversity of the A-domains we have combined homology-search methods with an alignment-free tool that uses protein four-color-maps. TI2BioP (Topological Indices to BioPolymers) version 2.0, available at http://ti2biop.sourceforge.net/ allowed the calculation of simple TIs from the protein sequences (four-color maps). Such TIs were used as input predictors for the statistical estimations required to build the alignment-free models. We concluded that the use of graphical/numerical approaches in cooperation with other sequence search methods, like multi-templates BLASTp and profile HMM, can give the most complete exploration of the repertoire of highly diverse protein families.

[1]  Reinaldo Molina Ruiz,et al.  An Alignment-Free Approach for Eukaryotic ITS2 Annotation and Phylogenetic Inference , 2011, IWBBIO.

[2]  Gajendra P. S. Raghava,et al.  COPid: Composition Based Protein Identification , 2008, Silico Biol..

[3]  P. Kollman,et al.  A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules , 1995 .

[4]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[5]  Milan Randic,et al.  On Interpretation of Well-Known Topological Indices , 2001, J. Chem. Inf. Comput. Sci..

[6]  Milan Randić,et al.  Graphical representations of DNA as 2-D map , 2004 .

[7]  Humberto González-Díaz,et al.  Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence. , 2009, Journal of proteome research.

[8]  Ernesto Estrada,et al.  In Silico Studies toward the Discovery of New Anti-HIV Nucleoside Compounds through the Use of TOPS-MODE and 2D/3D Connectivity Indices. 2. Purine Derivatives , 2005, J. Chem. Inf. Model..

[9]  Alexander Keller,et al.  The ITS2 Database III—sequences and structures for phylogeny , 2009, Nucleic Acids Res..

[10]  Francisco Torrens,et al.  3D-chiral quadratic indices of the 'molecular pseudograph's atom adjacency matrix' and their application to central chirality codification: classification of ACE inhibitors and prediction of sigma-receptor antagonist activities. , 2004, Bioorganic & medicinal chemistry.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  Humberto González-Díaz,et al.  Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. , 2006, FEBS letters.

[13]  Lourdes Santana,et al.  A QSAR model for in silico screening of MAO-A inhibitors. Prediction, synthesis, and biological assay of novel coumarins. , 2006, Journal of medicinal chemistry.

[14]  M. Randic,et al.  Highly compact 2D graphical representation of DNA sequences , 2004, SAR and QSAR in environmental research.

[15]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[16]  Gitanjali Yadav,et al.  NRPS-PKS: a knowledge-based resource for analysis of NRPS/PKS megasynthases , 2004, Nucleic Acids Res..

[17]  Tomaz Pisanski,et al.  Graphical representation of proteins as four-color maps and their numerical characterization. , 2009, Journal of molecular graphics & modelling.

[18]  Etsuko N Moriyama,et al.  Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. , 2007, Genomics.

[19]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[20]  Jos Boekhorst,et al.  Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties , 2007, BMC Bioinformatics.

[21]  Ernesto Estrada,et al.  Spectral Moments of the Edge Adjacency Matrix in Molecular Graphs, 1. Definition and Applications to the Prediction of Physical Properties of Alkanes , 1996, J. Chem. Inf. Comput. Sci..

[22]  Dejan Plavšić,et al.  Four-color map representation of DNA or RNA sequences and their numerical characterization , 2005 .

[23]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[24]  Milan Randic Very efficient search for protein alignment—VESPA , 2012, J. Comput. Chem..

[25]  Yunierkis Pérez-Castillo,et al.  TI2BioP: Topological Indices to BioPolymers. Its practical use to unravel cryptic bacteriocin-like domains , 2011, Amino Acids.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Kuo-Chen Chou,et al.  Prediction of protein secondary structure content by artificial neural network , 2003, J. Comput. Chem..

[28]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[29]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[30]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[31]  Milan Randic Very efficient search for nucleotide alignments , 2013, J. Comput. Chem..

[32]  H. González-Díaz,et al.  Review of QSAR models for enzyme classes of drug targets: Theoretical background and applications in parasites, hosts, and other organisms. , 2010, Current pharmaceutical design.

[33]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[34]  Oscar P. Kuipers,et al.  BAGEL: a web-based bacteriocin genome mining tool , 2006, Nucleic Acids Res..

[35]  A Nandy Recent investigations into global characteristics of long DNA sequences. , 1994, Indian journal of biochemistry & biophysics.

[36]  Liu Yang,et al.  3-D maps and coupling numbers for protein sequences , 2009 .

[37]  C. Munteanu,et al.  Generalized lattice graphs for 2D-visualization of biological information , 2009, Journal of Theoretical Biology.

[38]  Ernesto Estrada,et al.  In Silico Studies toward the Discovery of New Anti-HIV Nucleoside Compounds with the Use of TOPS-MODE and 2D/3D Connectivity Indices, 1. Pyrimidyl Derivatives , 2002, J. Chem. Inf. Comput. Sci..

[39]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[40]  K. Chou,et al.  Artificial Neural Network Model for Predicting Membrane Protein Types , 2001, Journal of biomolecular structure & dynamics.

[41]  M. Welker,et al.  Cyanobacterial peptides - nature's own combinatorial biosynthesis. , 2006, FEMS microbiology reviews.

[42]  Jure Zupan,et al.  On representation of proteins by star-like graphs. , 2007, Journal of molecular graphics & modelling.

[43]  Maykel Cruz-Monteagudo,et al.  3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. , 2008, Chemical research in toxicology.

[44]  Alexandru T Balaban,et al.  Graphical representation of proteins. , 2011, Chemical reviews.

[45]  E Uriarte,et al.  Recent advances on the role of topological indices in drug discovery research. , 2001, Current medicinal chemistry.

[46]  Léon Personnaz,et al.  On Cross Validation for Model Selection , 1999, Neural Computation.

[47]  Elke Dittmann,et al.  Bioinformatic perspectives on NRPS/PKS megasynthases: advances and challenges. , 2009, Natural product reports.

[48]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[49]  Rajesh S. Gokhale,et al.  In silico analysis of methyltransferase domains involved in biosynthesis of secondary metabolites , 2008, BMC Bioinformatics.

[50]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[51]  A. Antunes,et al.  Non-linear models based on simple topological indices to identify RNase III protein members. , 2011, Journal of theoretical biology.

[52]  Kuo-Chen Chou,et al.  Artificial Neural Network Model for Predicting Protein Subcellular Location , 2002, Comput. Chem..

[53]  Francisco Torrens,et al.  Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic. , 2005, Bioorganic & medicinal chemistry.