A comparison of classifiers for predicting the class color of fluorescent proteins

Fluorescent proteins have been applied in a wide variety of fields ranging from basic science to industrial applications. Apart from the naturally occurring fluorescent proteins, there is a growing interest in genetically modified variants that emit light in a specific wavelength. Genetically modifying a protein is not an easy task, especially because the exchange of one residue by other has to achieve the desired property while maintaining protein stability. To help in the choice of residue exchange, computational methods are applied to predict function and stability of proteins. In this work we have prepared a dataset composed by 109 fluorescent proteins and tested four classical supervised classification algorithms: artificial neural networks (ANNs), decision trees (DTs), support vector machines (SVMs) and random forests (RFs). This is the first time that algorithms are compared in this task. Results of comparing the algorithm's performance shows that DT, SVM and RF were significantly better than ANNs, and RF was the best method in all the scenarios. However, the interpretability of DTs is highly relevant and can provide important clues about the mechanisms involved in protein color emission. The results are promising and indicate that the use of in silico methods can greatly reduce the time and cost of the in vitro experiments.

[1]  J. Sola,et al.  Importance of input data normalization for the application of neural networks to complex industrial problems , 1997 .

[2]  D. O'Kane,et al.  Green‐fluorescent protein mutants with altered fluorescence excitation spectra , 1995, FEBS letters.

[3]  Nathan S. Claxton,et al.  The Fluorescent Protein Color Palette , 2006, Current protocols in cell biology.

[4]  Alex Brown,et al.  Computational prediction of absorbance maxima for a structurally diverse series of engineered green fluorescent protein chromophores. , 2008, The journal of physical chemistry. B.

[5]  M. Michael Gromiha,et al.  Data mining application in biomedical informatics for probing into protein stability upon double mutation , 2014 .

[6]  Nikolaos E. Labrou,et al.  Random mutagenesis methods for in vitro directed enzyme evolution. , 2009 .

[7]  Atsushi Miyawaki,et al.  Green Fluorescent Protein Glows Gold , 2008, Cell.

[8]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Quantitative structure–property relationship study of spectral properties of green fluorescent protein with support vector machine , 2013 .

[9]  Chengqi Zhang,et al.  Data preparation for data mining , 2003, Appl. Artif. Intell..

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  John Hixon,et al.  Algorithm for the Analysis of Tryptophan Fluorescence Spectra and Their Correlation with Protein Structural Parameters , 2009, Algorithms.

[13]  S. Lukyanov,et al.  Fluorescent proteins and their applications in imaging living cells and tissues. , 2010, Physiological reviews.

[14]  K K Baldridge,et al.  The structure of the chromophore within DsRed, a red fluorescent protein from coral. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Y. Reshetnyak,et al.  Decomposition of protein tryptophan fluorescence spectra into log-normal components. III. Correlation between fluorescence and microenvironment parameters of individual tryptophan residues. , 2001, Biophysical journal.

[16]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[17]  H. MacIsaac,et al.  Popularity and Propagule Pressure: Determinants of Introduction and Establishment of Aquarium Fish , 2006, Biological Invasions.

[18]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Prediction of GFP spectral properties using artificial neural network , 2007, J. Comput. Chem..

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.