When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50 000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.
[1]
David Z. Chen,et al.
Automatic reaction mapping and reaction center detection
,
2013
.
[2]
Daniel M. Lowe,et al.
LeadMine: a grammar and dictionary driven approach to entity recognition
,
2015,
Journal of Cheminformatics.
[3]
George Papadatos,et al.
SureChEMBL: a large-scale, chemically annotated patent document database
,
2015,
Nucleic Acids Res..
[4]
Roger A. Sayle,et al.
Get Your Atoms in Order - An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm
,
2015,
J. Chem. Inf. Model..
[5]
Pierre Baldi,et al.
ReactionMap: An Efficient Atom-Mapping Algorithm for Chemical Reactions
,
2013,
J. Chem. Inf. Model..
[6]
David Rogers,et al.
Extended-Connectivity Fingerprints
,
2010,
J. Chem. Inf. Model..
[7]
Daniel M. Lowe,et al.
Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists' Bread and Butter.
,
2016,
Journal of medicinal chemistry.
[8]
James E. Blake,et al.
CASREACT: more than a million reactions
,
1990,
J. Chem. Inf. Comput. Sci..
[9]
Daniel M. Lowe.
Extraction of chemical structures and reactions from the literature
,
2012
.
[10]
Brian E. Granger,et al.
IPython: A System for Interactive Scientific Computing
,
2007,
Computing in Science & Engineering.
[11]
Daniel M. Lowe,et al.
Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity
,
2015,
J. Chem. Inf. Model..