A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora

A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.

[1]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[2]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[3]  Hagit Shatkay,et al.  Information retrieval meets gene analysis , 2002 .

[4]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[5]  Paul Buitelaar,et al.  RelExt: A Tool for Relation Extraction from Text in Ontology Extension , 2005, SEMWEB.

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[8]  Jean-Daniel Fekete,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS , 2022 .

[9]  Naohiko Uramoto,et al.  A text-mining system for knowledge discovery from biomedical documents , 2004, IBM Syst. J..

[10]  Yonatan Aumann,et al.  Circle Graphs: New Visualization Tools for Text-Mining , 1999, PKDD.

[11]  Prabhakar Raghavan,et al.  Sparse matrix reordering schemes for browsing hypertext , 1996 .

[12]  H. Markov,et al.  An algorithm to , 1997 .

[13]  Hong-Gee Kim,et al.  A Concept-Driven Automatic Ontology Generation Approach for Conceptualization of Document Corpora , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[15]  James F. Allen Natural language understanding (2nd ed.) , 1995 .

[16]  Hsinchun Chen,et al.  Kernel-based learning for biomedical relation extraction , 2008, J. Assoc. Inf. Sci. Technol..

[17]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[18]  Chenyi Zhang,et al.  An Algorithm for , 2011 .

[19]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[20]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[21]  Aldo Gangemi,et al.  Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology , 2005, IJCAI.

[22]  Chris F. Taylor,et al.  The use of concept maps during knowledge elicitation in ontology development processes – the nutrigenomics use case , 2006, BMC Bioinformatics.

[23]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[24]  ChenHsinchun,et al.  Kernel-based learning for biomedical relation extraction , 2008 .

[25]  Hong-Gee Kim,et al.  Exploiting Gene Ontology to Conceptualize Biomedical Document Collections , 2008, ASWC.

[26]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[27]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[28]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[29]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[30]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[31]  Lipika Dey,et al.  Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining , 2007, Data Knowl. Eng..

[32]  Yonatan Aumann,et al.  Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections , 1997, KDD.

[33]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[34]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[35]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[36]  Fabio Rinaldi,et al.  Mining relations in the GENIA corpus , 2004 .

[37]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[38]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.