Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence.

Polygalacturonases (PGs) have called the attention of microbiology scientists and biotechnology or pharmaceutical industry because they are protein enzymes relevant to phytopathogens invasion, fruit ripening, and potential antimicrobial drug targets. Numeric Topological Indices (TIs) of protein pseudofolding lattices can be used as input for classification algorithms in Quantitative Structure-Activity Relationship (OSAR) studies. However, a comparative study of different OSAR models for PGs has not been reported. In this study, we calculated for the first time two classes of TIs (Spectral moments (pik) and Entropy (thetak) values) for the Markov matrices associated to pseudofolding lattices of 108 PGs and 100 non-PGs heterogeneous proteins. Afterward, we developed different linear classifiers based on Linear Discriminant Analysis (LDA) and four types of nonlinear Artificial Neural Networks (ANN). The pik-LDA model correctly classified 98.8% of PGs and 100% non-PGs used to train the model, as well as 98.1% of all sequences used as external validation series. The rk-LDA model was the more accurate and/or simpler found. In addition, we report for the first time the experimental isolation and successful prediction of a new PG sequence from Coffea arabica. This sequence was deposited in the GenBank by our group with accession number GDQ336394. The present type of models are an interesting alignment-free complement to alignment-based procedures.