DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation , the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai , lets the user upload a pdf file and retrieve the segmented structure depictions.

[1]  Guillaume Gravier,et al.  Proceedings of the 27th ACM International Conference on Multimedia , 2019, ACM Multimedia.

[2]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[3]  Callum Court,et al.  ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature , 2017 .

[4]  C. Steinbeck,et al.  DECIMER: towards deep learning for chemical image recognition , 2020, Journal of Cheminformatics.

[5]  Igor V. Filippov,et al.  Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution , 2009, J. Chem. Inf. Model..

[6]  Edward Beard,et al.  ChemSchematicResolver: A Toolkit to Decode 2D Chemical Diagrams with Labels and R-Groups into Annotated Chemical Named Entities , 2020, J. Chem. Inf. Model..

[7]  A. Valencia,et al.  Information Retrieval and Text Mining Technologies for Chemistry. , 2017, Chemical reviews.

[8]  Axel Drefahl,et al.  CurlySMILES: a chemical language to customize and annotate encodings of molecular and nanodevice structures , 2011, J. Cheminformatics.

[9]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[10]  Kohulan Rajan,et al.  A review of optical chemical structure recognition tools , 2020, Journal of Cheminformatics.

[11]  Igor V. Filippov,et al.  Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on , 2011, J. Cheminformatics.

[12]  Jacqueline M. Cole,et al.  ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature , 2016, J. Chem. Inf. Model..

[13]  Robert Abel,et al.  Molecular Structure Extraction From Documents Using Deep Learning , 2018, J. Chem. Inf. Model..

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.