Hybrid ontology-learning materials engineering system for pharmaceutical products: Multi-label entity recognition and concept detection

Abstract The dawn of a new era in knowledge management due to information explosion is making old habits of modeling knowledge and decision-making inadequate. In the search for new modeling paradigms, we expect ontologies to play a big role. One of the critical challenges we face is the scarcity of semantically rich, properly populated, ontologies in most application domains in chemical and materials engineering. Developing such ontologies is a very challenging task requiring considerable investment in time, effort, and expert knowledge. One needs automation tools that can assist an ontology engineer to quickly develop and curate domain-specific ontologies. We consider our conceptual framework in this paper, a general approach for populating scientific ontologies, and its implementation as the prototype HOLMES, as an early attempt towards such an automated knowledge management environment. Our approach integrates a variety of machine learning and natural language processing methods to extract information from journal articles and store them semantically in an ontology. In this work, identification of key terms (such as chemicals, drugs, processes, anatomical entities, etc.) from abstracts, and the classification of these terms into 25 classes are presented. Two methods, a multi-class classifier (SVM) and a multi-label classifier (HOMER), were tested on an annotated data set for the pharmaceutical industry. The test was done using two different versions of the same data set, one using the BIO notation and the other not. The F1 scores for HOMER, were better in the BIO notation (63.6% vs 48.5%) while SVM performed better in the non-BIO version (54.1% vs 53.2%). However, the standard metrics did not consider the effect of the multiple answers that the multi-label classifier is allowed to obtain. As the results of our computational experiments show, while the performance of multi-label classifier is encouraging, much more remains to be done in order to develop a practically viable automated ontology-based knowledge management system.

[1]  Hai Hu,et al.  Analysis of Metabolic and Regulatory Pathways through Gene Ontology-Derived Semantic Similarity Measures , 2005, AMIA.

[2]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[3]  Jan Morbach,et al.  OntoCAPE: A Re-Usable Ontology for Chemical Process Engineering , 2009 .

[4]  Toshihiro Ashino,et al.  Materials Ontology: An Infrastructure for Exchanging Materials Information and Knowledge , 2010, Data Sci. J..

[5]  Gintaras V. Reklaitis,et al.  Ontological informatics infrastructure for pharmaceutical product development and manufacturing , 2006, Comput. Chem. Eng..

[6]  William W. Agresti Discovery informatics , 2003, CACM.

[7]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[8]  Peter Clark,et al.  Learning Biological Processes with Global Constraints , 2013, EMNLP.

[9]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[10]  Horst Herrlich,et al.  Category theory , 1979 .

[11]  Frank T. Bergmann,et al.  Standards and ontologies in computational systems biology. , 2008, Essays in biochemistry.

[12]  Masakazu Suzuki,et al.  Grammatical Verification for Mathematical Formula Recognition Based on Context-Free Tree Grammar , 2010, Math. Comput. Sci..

[13]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[14]  Michael ODonnell,et al.  Demonstration of the UAM CorpusTool for Text and Image Annotation , 2008, ACL.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Heng Ji,et al.  Constrained Information-Theoretic Tripartite Graph Clustering to Identify Semantically Similar Relations , 2015, IJCAI.

[17]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[18]  Viviana Mascardi,et al.  A Comparison of Upper Ontologies , 2007, WOA.

[19]  Jeff Grabowski,et al.  Integrated Computational Materials Engineering , 2016 .

[20]  Yalin Wang,et al.  Document zone content classification and its performance evaluation , 2006, Pattern Recognit..

[21]  Gintaras V. Reklaitis,et al.  OntoMODEL: Ontological Mathematical Modeling Knowledge Management in Pharmaceutical Product Development, 1: Conceptual Framework , 2010 .

[22]  Kate Byrne,et al.  Populating the Semantic Web: Combining Text and Relational Databases as RDF , 2010 .

[23]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[24]  Venkat Venkatasubramanian,et al.  An ontological framework for automated regulatory compliance in pharmaceutical manufacturing , 2010, Comput. Chem. Eng..

[25]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[26]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[27]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[28]  David S. Doermann,et al.  Context-aware and content-based dynamic Voronoi page segmentation , 2010, DAS '10.

[29]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[30]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[31]  Edrisi Muñoz,et al.  Integration of enterprise levels based on an ontological framework , 2013 .

[32]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[33]  Oriol Ramos Terrades,et al.  Flowchart recognition for non-textual information retrieval in patent search , 2013, Information Retrieval.

[34]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[35]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[36]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[37]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[38]  Florence Amardeilh OntoPop or how to annotate documents and populate ontologies from texts , 2006 .

[39]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[40]  Masakazu Suzuki,et al.  INFTY: an integrated OCR system for mathematical documents , 2003, DocEng '03.

[41]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[42]  Gintaras V. Reklaitis,et al.  OntoMODEL: Ontological Mathematical Modeling Knowledge Management in Pharmaceutical Product Development, 2: Applications , 2010 .

[43]  Russ B. Altman,et al.  Discovery and Explanation of Drug-Drug Interactions via Text Mining , 2011, Pacific Symposium on Biocomputing.

[44]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[45]  Jenny A. Harding,et al.  A manufacturing system engineering ontology model on the semantic web for inter-enterprise collaboration , 2007, Comput. Ind..

[46]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[47]  Peter Clark,et al.  Modeling Biological Processes for Reading Comprehension , 2014, EMNLP.

[48]  Antonio Espuña Camarasa,et al.  Using mathematical knowledge management to support integrated decision-making in the enterprise , 2014, Comput. Chem. Eng..

[49]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[50]  Venkat Venkatasubramanian,et al.  DROWNING IN DATA: Informatics and modeling challenges in a data‐rich networked world , 2009 .

[51]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[52]  Edrisi Muñoz,et al.  Ontological framework for enterprise-wide integrated decision-making at operational level , 2012, Comput. Chem. Eng..

[53]  Venkat Venkatasubramanian,et al.  Purdue Ontology for Pharmaceutical Engineering: Part I. Conceptual Framework , 2010, Journal of Pharmaceutical Innovation.

[54]  Markus Hofmann,et al.  RapidMiner: Data Mining Use Cases and Business Analytics Applications , 2013 .