PREFMoDeL: A Systematic Review and Proposed Taxonomy of Biomolecular Features for Deep Learning

Of fundamental importance in biochemical and biomedical research is understanding a molecule’s biological properties—its structure, its function(s), and its activity(ies). To this end, computational methods in Artificial Intelligence, in particular Deep Learning (DL), have been applied to further biomolecular understanding—from analysis and prediction of protein–protein and protein–ligand interactions to drug discovery and design. While choosing the most appropriate DL architecture is vitally important to accurately model the task at hand, equally important is choosing the features used as input to represent molecular properties in these DL models. Through hypothesis testing, bioinformaticians have created thousands of engineered features for biomolecules such as proteins and their ligands. Herein we present an organizational taxonomy for biomolecular features extracted from 808 articles from across the scientific literature. This objective view of biomolecular features can reduce various forms of experimental and/or investigator bias and additionally facilitate feature selection in biomolecular analysis and design tasks. The resulting dataset contains 1360 nondeduplicated features, and a sample of these features were classified by their properties, clustered, and used to suggest new features. The complete feature dataset (the Public Repository of Engineered Features for Molecular Deep Learning, PREFMoDeL) is released for collaborative sourcing on the web.

[1]  Paul K. Korir,et al.  EMPIAR: the Electron Microscopy Public Image Archive , 2022, bioRxiv.

[2]  George M. Church,et al.  Single-sequence protein structure prediction using a language model and deep learning , 2022, Nature Biotechnology.

[3]  B. Sankaran,et al.  Robust deep learning based protein sequence design using ProteinMPNN , 2022, bioRxiv.

[4]  M. Schenone,et al.  Phenotypic drug discovery: recent successes, lessons learned and new directions , 2022, Nature Reviews Drug Discovery.

[5]  J. Goodman,et al.  A review of molecular representation in the age of machine learning , 2022, WIREs Computational Molecular Science.

[6]  D. Hassabis,et al.  Protein structure predictions to atomic accuracy with AlphaFold , 2022, Nature Methods.

[7]  Chandrashekhar Azad,et al.  A comprehensive survey on feature selection in the various fields of machine learning , 2021, Applied Intelligence.

[8]  Hua Wu,et al.  Geometry-enhanced molecular representation learning for property prediction , 2021, Nature Machine Intelligence.

[9]  Andreas Kerren,et al.  FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches , 2021, IEEE Transactions on Visualization and Computer Graphics.

[10]  Amir Barati Farimani,et al.  Molecular contrastive learning of representations via graph neural networks , 2021, Nature Machine Intelligence.

[11]  Shuiwang Ji,et al.  Advanced Graph and Sequence Neural Networks for Molecular Property Prediction and Drug Discovery. , 2020, Bioinformatics.

[12]  Souad Amjad,et al.  Feature Selection: A Review and Comparative Study , 2022, E3S Web of Conferences.

[13]  Y. Chalopin,et al.  Energy Bilocalization Effect and the Emergence of Molecular Functions in Proteins , 2021, Frontiers in Molecular Biosciences.

[14]  Maher G. M. Abdolrasol,et al.  Artificial Neural Networks Based Optimization Techniques: A Review , 2021, Electronics.

[15]  P. Sorger,et al.  Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms , 2021, Nature Methods.

[16]  Mario Barbatti,et al.  Choosing the right molecular machine learning potential , 2021, Chemical science.

[17]  Suresh Dara,et al.  Machine Learning in Drug Discovery: A Review , 2021, Artificial Intelligence Review.

[18]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[19]  Kelin Xia,et al.  Persistent spectral–based machine learning (PerSpect ML) for protein-ligand binding affinity prediction , 2021, Science Advances.

[20]  B. Pulendran,et al.  Adjuvanting a subunit COVID-19 vaccine to induce protective immunity , 2021, Nature.

[21]  J. Bajorath,et al.  Machine learning reveals that structural features distinguishing promiscuous and non-promiscuous compounds depend on target combinations , 2021, Scientific Reports.

[22]  Ivet Bahar,et al.  ProDy 2.0: increased scale and scope after 10 years of protein dynamics modelling with Python , 2021, Bioinform..

[23]  Farhad Soleimanian Gharehchopogh,et al.  A multi-objective optimization algorithm for feature selection problems , 2021, Engineering with Computers.

[24]  Peter C. St. John,et al.  Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties. , 2021, Accounts of chemical research.

[25]  Ellen D. Zhong,et al.  CryoDRGN: Reconstruction of heterogeneous cryo-EM structures using neural networks , 2021, Nature Methods.

[26]  Toni Giorgino,et al.  TorchMD: A Deep Learning Framework for Molecular Simulations , 2020, Journal of chemical theory and computation.

[27]  Geoffroy Hautier,et al.  Chemist versus Machine: Traditional Knowledge versus Machine Learning Techniques , 2020, Trends in Chemistry.

[28]  Isaac Tamblyn,et al.  Scientific intuition inspired by machine learning-generated hypotheses , 2020, Mach. Learn. Sci. Technol..

[29]  Christian Feldmann,et al.  Prediction of Promiscuity Cliffs Using Machine Learning , 2020, Molecular informatics.

[30]  Rung-Ching Chen,et al.  Selecting critical features for data classification based on machine learning methods , 2020, Journal of Big Data.

[31]  Seyedali Mirjalili,et al.  Approaches to Multi-Objective Feature Selection: A Systematic Literature Review , 2020, IEEE Access.

[32]  Farnaz Heidar-Zadeh,et al.  Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Methods. , 2020, Chem.

[33]  T. Lillicrap,et al.  Backpropagation and the brain , 2020, Nature Reviews Neuroscience.

[34]  Qi Wang,et al.  A Comprehensive Survey of Loss Functions in Machine Learning , 2020, Annals of Data Science.

[35]  G. Wei,et al.  A review of mathematical representations of biomolecular data. , 2019, Physical chemistry chemical physics : PCCP.

[36]  Frank Noé,et al.  Machine learning for molecular simulation , 2019, Annual review of physical chemistry.

[37]  Yihang Wang,et al.  Machine learning approaches for analyzing and enhancing molecular dynamics simulations. , 2019, Current opinion in structural biology.

[38]  G. Schneider,et al.  Rethinking drug design in the artificial intelligence era , 2019, Nature Reviews Drug Discovery.

[39]  Claire Lesieur,et al.  Experimental Protein Molecular Dynamics: Broadband Dielectric Spectroscopy coupled with nanoconfinement , 2019, Scientific Reports.

[40]  A. Aspuru‐Guzik,et al.  Oscillatory Active-site Motions Correlate with Kinetic Isotope Effects in Formate Dehydrogenase. , 2019, ACS catalysis.

[41]  A. Camproux,et al.  High Impact: The Role of Promiscuous Binding Sites in Polypharmacology , 2019, Molecules.

[42]  Huanchen Wang,et al.  Dynamics of Substrate Processing by PPIP5K2, a Versatile Catalytic Machine. , 2019, Structure.

[43]  P. Reddy,et al.  Structure Based Design and Molecular Docking Studies for Phosphorylated Tau Inhibitors in Alzheimer’s Disease , 2019, Cells.

[44]  Bruce Tidor,et al.  Machine Learning Identifies Chemical Characteristics That Promote Enzyme Catalysis , 2019, Journal of the American Chemical Society.

[45]  Benjamin D. Allen,et al.  ProtaBank: A repository for protein design and engineering data , 2019, Protein science : a publication of the Protein Society.

[46]  Jürgen Bajorath,et al.  Promiscuous Ligands from Experimentally Determined Structures, Binding Conformations, and Protein Family-Dependent Interaction Hotspots , 2019, ACS omega.

[47]  Yan Li,et al.  Comparative Assessment of Scoring Functions: The CASF-2016 Update , 2018, J. Chem. Inf. Model..

[48]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[49]  V. Schramm,et al.  Promoting Vibrations and the Function of Enzymes. Emerging Theoretical and Experimental Convergence. , 2018, Biochemistry.

[50]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[51]  Janet M. Thornton,et al.  Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites , 2017, Nucleic Acids Res..

[52]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[53]  Stefan Naulaerts,et al.  Predicting the Reliability of Drug-target Interaction Predictions with Maximum Coverage of Target Space , 2017, Scientific Reports.

[54]  Jerome G. P. Wicker,et al.  Beyond Rotatable Bond Counts: Capturing 3D Conformational Flexibility in a Single Descriptor , 2016, J. Chem. Inf. Model..

[55]  S. Leibler,et al.  Strain analysis of protein structures and low dimensionality of mechanical allosteric couplings , 2016, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Ren Wei,et al.  Effect of Tris, MOPS, and phosphate buffers on the hydrolysis of polyethylene terephthalate films by polyester hydrolases , 2016, FEBS open bio.

[57]  Benjamin A. Himes,et al.  Cyclophilin A stabilizes the HIV-1 capsid through a novel non-canonical binding site , 2016, Nature Communications.

[58]  John D. Westbrook,et al.  EMDataBank unified data resource for 3DEM , 2013, Nucleic Acids Res..

[59]  Thomas J Lane,et al.  MDTraj: a modern, open library for the analysis of molecular dynamics trajectories , 2014, bioRxiv.

[60]  Frank Noé,et al.  PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. , 2015, Journal of chemical theory and computation.

[61]  Jie Li,et al.  PDB-wide collection of binding data: current status of the PDBbind database , 2015, Bioinform..

[62]  Dmitri I. Svergun,et al.  SASBDB, a repository for biological small-angle scattering data , 2014, Nucleic Acids Res..

[63]  Vipin Kumar,et al.  Feature Selection: A literature Review , 2014, Smart Comput. Rev..

[64]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[65]  J. D. Smith,et al.  Implicit and explicit categorization: A tale of four species , 2012, Neuroscience & Biobehavioral Reviews.

[66]  Thomas J Lane,et al.  MSMBuilder2: Modeling Conformational Dynamics at the Picosecond to Millisecond Scale. , 2011, Journal of chemical theory and computation.

[67]  J. Ioannidis,et al.  The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration , 2009, BMJ : British Medical Journal.

[68]  H. Hellinga,et al.  Structural reorganization and preorganization in enzyme active sites: comparisons of experimental and theoretically ideal active site geometries in the multistep serine esterase reaction cycle. , 2008, Journal of the American Chemical Society.

[69]  Miron Livny,et al.  BioMagResBank , 2007, Nucleic Acids Res..

[70]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[71]  L. Kay,et al.  Intrinsic dynamics of an enzyme underlies catalysis , 2005, Nature.

[72]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[73]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[74]  J. Drews Drug discovery: a historical perspective. , 2000, Science.

[75]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[76]  Z. Dienes,et al.  A theory of implicit and explicit knowledge , 1999, Behavioral and Brain Sciences.