Utilizing image-based features in biomedical document classification

Images form a rich information source, which remains underutilized in biomedical document classification. We present here work that uses both image- and text-based features in order to identify articles of interest, in this case, pertaining to cis-regulatory modules in the context of gene-networks. Extending on our new idea, which we have recently introduced, of using OCR-based features to identify DNA contents in images, we combine image and text based classifiers to categorize documents as relevant or irrelevant to cis-regulatory modules. Using a set of hundreds of articles, marked by experts as relevant or irrelevant to cis-regulatory modules, we train/test image and text based classifiers, as well as classifiers integrating both. Our results indicate that the latter show the best performance with Recall, F-measure and Utility measures all above 0.9, demonstrating the significance of incorporating image data, and specifically OCR-based features, into the document categorization process. Moreover, the use of character distribution properties to represent images is directly relevant to other biomedical images containing text (e.g. RNA, proteins). Diagrams and other images containing text are also prevalent outside the biomedical domain, hence the work stands to be applicable and beneficial in other application areas.

[1]  Michael Krauthammer,et al.  Finding and Accessing Diagrams in Biomedical Publications , 2012, AMIA.

[2]  Sophia Ananiadou,et al.  Mining the Biomedical Literature , 2015 .

[3]  Michael Krauthammer,et al.  Exploring the use of image text for biomedical literature retrieval. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[5]  Jie Yao,et al.  Searching online journals for fluorescence microscope images depicting protein subcellular location patterns , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[6]  William W. Cohen,et al.  Extracting information from text and images for location proteomics , 2003, BIOKDD.

[7]  Hagit Shatkay,et al.  OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[8]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Hong Yu,et al.  Automatic Figure Ranking and User Interfacing for Intelligent Figure Search , 2010, PloS one.

[11]  Shih-Fu Chang,et al.  Exploring Text and Image Features to Classify Images in Bioscience Literature , 2006, BioNLP@NAACL-HLT.

[12]  Yuntao Qian,et al.  Improved recognition of figures containing fluorescence microscope images in online journal articles using graphical models , 2008, Bioinform..

[13]  George R. Thoma,et al.  Annotation and retrieval of clinically relevant images , 2009, Int. J. Medical Informatics.

[14]  Marti A. Hearst,et al.  TREC 2004 Genomics Track Overview , 2005, TREC.

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  J. Shay,et al.  Hypoxia-Inducible Factor 1 Mediates Upregulation of Telomerase (hTERT) , 2004, Molecular and Cellular Biology.

[17]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[18]  Zhiyong Lu,et al.  The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text , 2011, BMC Bioinformatics.

[19]  Axel-Cyrille Ngonga Ngomo,et al.  BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Raul Rodriguez-Esteban,et al.  Figure mining for biomedical research , 2009, Bioinform..

[22]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[23]  哲史 小池,et al.  Hypoxia inducible factorにより誘導される癌細胞の接着分子 , 2005 .

[24]  G. Tell,et al.  Functional interaction among thyroid-specific transcription factors: Pax8 regulates the activity of Hex promoter , 2004, Molecular and Cellular Endocrinology.

[25]  Hagit Shatkay,et al.  Integrating image data into biomedical text categorization , 2006, ISMB.

[26]  William R. Hersh,et al.  Feature Generation, Feature Selection, Classifiers, and Conceptual Drift for Biomedical Document Triage , 2004, TREC.

[27]  S. Istrail,et al.  Practical computational methods for regulatory genomics: a cisGRN-Lexicon and cisGRN-browser for gene regulatory networks. , 2010, Methods in molecular biology.