EXSCLAIM! - An automated pipeline for the construction of labeled materials imaging datasets from literature

Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for traditional data ingestion scenarios where a select number of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are presented as components of a larger figure with their explicit context buried in the main body or caption text, so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. To solve the problem of curating labeled microscopy data from literature, this work introduces the EXSCLAIM! Python toolkit for the automatic EXtraction, Separation, and Caption-based natural Language Annotation of IMages from scientific literature. We highlight the methodology behind the construction of EXSCLAIM! and demonstrate its ability to extract and label open-source scientific images at high volume.

[1]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Callum Court,et al.  ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature , 2017 .

[3]  Tolga Tasdizen,et al.  Decoding crystallography from high-resolution electron imaging and diffraction datasets with deep learning , 2019, Science Advances.

[4]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Bill Howe,et al.  Deep Mapping of the Visual Literature , 2017, WWW.

[7]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[9]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Emma Strubell,et al.  Machine-learned and codified synthesis parameters of oxide materials , 2017, Scientific Data.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  R. Ramprasad,et al.  Machine Learning in Materials Science , 2016 .

[13]  Elizabeth A. Holm,et al.  A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures , 2016, Data in brief.

[14]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  I. Foster,et al.  The Materials Data Facility: Data Services to Advance Materials Science Research , 2016, JOM.

[17]  M. Chi,et al.  Sub-Ångstrom electric field measurements on a universal detector in a scanning transmission electron microscope , 2018, Advanced Structural and Chemical Imaging.

[18]  Sergei V. Kalinin,et al.  Big-deep-smart data in imaging for guiding materials design. , 2015, Nature materials.

[19]  A. McCallum,et al.  Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning , 2017 .

[20]  Jordi Vitrià,et al.  ResNet , 2021, Computer-Aided Analysis of Gastrointestinal Videos.

[21]  David J. Crandall,et al.  A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[22]  Kyle Chard,et al.  A data ecosystem to support machine learning in materials science , 2019, MRS Communications.

[23]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Xian-Sheng Hua,et al.  Prajna: Towards Recognizing Whatever You Want from Images without Image Labeling , 2015, AAAI.

[25]  Eric P. Xing,et al.  Structured literature image finder: Parsing text and figures in biomedical literature , 2010, J. Web Semant..

[26]  Gang Wang,et al.  Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Brian L. DeCost,et al.  UHCSDB: UltraHigh Carbon Steel Micrograph DataBase , 2017, Integrating Materials and Manufacturing Innovation.

[28]  Stefano Cozzini,et al.  The first annotated set of scanning electron microscopy images for nanoscience , 2018, Scientific Data.

[29]  Callum J Court,et al.  Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction , 2018, Scientific Data.

[30]  Zhenwei Li,et al.  Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces. , 2015, Physical review letters.

[31]  Jian Zhang,et al.  Towards Automatic Construction of Diverse, High-Quality Image Datasets , 2017, IEEE Transactions on Knowledge and Data Engineering.

[32]  Jacqueline M Cole,et al.  ImageDataExtractor: A Tool To Extract and Quantify Data from Microscopy Images , 2020, J. Chem. Inf. Model..

[33]  Weixin Jiang,et al.  A Two-Stage Framework for Compound Figure Separation , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[34]  Jie Yao,et al.  Searching online journals for fluorescence microscope images depicting protein subcellular location patterns , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[35]  Sergei V. Kalinin,et al.  Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography , 2016, Scientific Reports.

[36]  Stefanie Jegelka,et al.  Virtual screening of inorganic materials synthesis parameters with deep learning , 2017, npj Computational Materials.

[37]  Gully A. P. C. Burns,et al.  Layout-aware Subfigure Decomposition for Complex Figures in the Biomedical Literature , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Oge Marques,et al.  Automatic separation of compound figures in scientific articles , 2016, Multimedia Tools and Applications.

[39]  Surya R. Kalidindi,et al.  Materials Data Science: Current Status and Future Outlook , 2015 .

[40]  Hagit Shatkay,et al.  Compound image segmentation of published biomedical figures , 2018, Bioinform..