Hierarchical Classification of Enzyme Promiscuity Using Positive, Unlabeled, and Hard Negative Examples

Despite significant progress in sequencing technology, there are many cellular enzymatic activities that remain unknown. We develop a new method, referred to as SUNDRY (Similarity-weighting for UNlabeled Data in a Residual HierarchY), for training enzyme-specific predictors that take as input a query substrate molecule and return whether the enzyme would act on that substrate or not. When addressing this enzyme promiscuity prediction problem, a major challenge is the lack of abundant labeled data, especially the shortage of labeled data for negative cases (enzyme-substrate pairs where the enzyme does not act to transform the substrate to a product molecule). To overcome this issue, our proposed method can learn to classify a target enzyme by sharing information from related enzymes via known tree hierarchies. Our method can also incorporate three types of data: those molecules known to be catalyzed by an enzyme (positive cases), those with unknown relationships (unlabeled cases), and molecules labeled as inhibitors for the enzyme. We refer to inhibitors as hard negative cases because they may be difficult to classify well: they bind to the enzyme, like positive cases, but are not transformed by the enzyme. Our method uses confidence scores derived from structural similarity to treat unlabeled examples as weighted negatives. We compare our proposed hierarchy-aware predictor against a baseline that cannot share information across related enzymes. Using data from the BRENDA database, we show that each of our contributions (hierarchical sharing, per-example confidence weighting of unlabeled data based on molecular similarity, and including inhibitors as hard-negative examples) contributes towards a better characterization of enzyme promiscuity.

[1]  Susumu Goto,et al.  SIMCOMP/SUBCOMP: chemical structure search servers for network analyses , 2010, Nucleic Acids Res..

[2]  S. E. Adams Molecular similarity and xenobiotic metabolism , 2010 .

[3]  Rodrigo C. Barros,et al.  Hierarchical Multi-Label Classification Networks , 2018, ICML.

[4]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[5]  Dan S. Tawfik,et al.  Enzyme promiscuity: evolutionary and mechanistic aspects. , 2006, Current opinion in chemical biology.

[6]  Jacob D. Durrant,et al.  Molecular dynamics simulations and drug discovery , 2011, BMC Biology.

[7]  Ping Fu,et al.  A hierarchical multi-label classification method based on neural networks for gene function prediction , 2018, Biotechnology & Biotechnological Equipment.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Regina Barzilay,et al.  Junction Tree Variational Autoencoder for Molecular Graph Generation , 2018, ICML.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[12]  Linda J. Broadbelt,et al.  Efficient searching and annotation of metabolic networks using chemical similarity , 2015, Bioinform..

[13]  Angelo D. Favia,et al.  Protein promiscuity and its implications for biotechnology , 2009, Nature Biotechnology.

[14]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[15]  Soha Hassoun,et al.  PROXIMAL: a method for Prediction of Xenobiotic Metabolism , 2015, BMC Systems Biology.

[16]  Juho Rousu,et al.  Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..

[17]  Dan S. Tawfik,et al.  Enzyme promiscuity: a mechanistic and evolutionary perspective. , 2010, Annual review of biochemistry.

[18]  P N Judson,et al.  Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. , 1999, SAR and QSAR in environmental research.

[19]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[20]  J. Keasling,et al.  Synthetic and systems biology for microbial production of commodity chemicals , 2016, npj Systems Biology and Applications.

[21]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[22]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[23]  James G. Jeffryes,et al.  Predicting novel substrates for enzymes with minimal experimental effort with active learning. , 2017, Metabolic engineering.

[24]  S. Placzek,et al.  The BRENDA enzyme information system-From a database to an expert system. , 2017, Journal of biotechnology.

[25]  Carol A Marchant,et al.  In Silico Tools for Sharing Data and Knowledge on Toxicity and Metabolism: Derek for Windows, Meteor, and Vitic , 2008, Toxicology mechanisms and methods.