Embedding Open-domain Common-sense Knowledge from Text

Our ability to understand language often relies on common-sense knowledge - background information the speaker can assume is known by the reader. Similarly, our comprehension of the language used in complex domains relies on access to domain-specific knowledge. Capturing common-sense and domain-specific knowledge can be achieved by taking advantage of recent advances in open information extraction (IE) techniques and, more importantly, of knowledge embeddings, which are multi-dimensional representations of concepts and relations. Building a knowledge graph for representing common-sense knowledge in which concepts discerned from noun phrases are cast as vertices and lexicalized relations are cast as edges leads to learning the embeddings of common-sense knowledge accounting for semantic compositionality as well as implied knowledge. Common-sense knowledge is acquired from a vast collection of blogs and books as well as from WordNet. Similarly, medical knowledge is learned from two large sets of electronic health records. The evaluation results of these two forms of knowledge are promising: the same knowledge acquisition methodology based on learning knowledge embeddings works well both for common-sense knowledge and for medical knowledge Interestingly, the common-sense knowledge that we have acquired was evaluated as being less neutral than than the medical knowledge, as it often reflected the opinion of the knowledge utterer. In addition, the acquired medical knowledge was evaluated as more plausible than the common-sense knowledge, reflecting the complexity of acquiring common-sense knowledge due to the pragmatics and economicity of language.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  H. Lowe,et al.  Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. , 1994, JAMA.

[3]  I. Obeid,et al.  The TUH EEG CORPUS: A big data resource for automated EEG interpretation , 2014, 2014 IEEE Signal Processing in Medicine and Biology Symposium (SPMB).

[4]  R. Schank,et al.  Knowledge and Memory: The Real Story , 1995 .

[5]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[6]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[7]  Roger C. Schank,et al.  Dynamic memory - a theory of reminding and learning in computers and people , 1983 .

[8]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[11]  J J Cimino,et al.  From ICD9-CM to MeSH using the UMLS: a how-to guide. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[12]  Peter D. Turney Domain and Function: A Dual-Space Model of Semantic Relations and Compositions , 2012, J. Artif. Intell. Res..

[13]  Sanda M. Harabagiu,et al.  Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[14]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[15]  Li Guo,et al.  Semantically Smooth Knowledge Graph Embedding , 2015, ACL.

[16]  Catherine Havasi,et al.  ConceptNet 5: A Large Semantic Network for Relational Knowledge , 2013, The People's Web Meets NLP.

[17]  R. Swanson,et al.  Identifying Personal Stories in Millions of Weblog Entries , 2009, ICWSM 2009.

[18]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[19]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[20]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[21]  Mathieu Lafourcade,et al.  Making people play for Lexical Acquisition with the JeuxDeMots prototype , 2007 .

[22]  Preslav Nakov,et al.  Solving Relational Similarity Problems Using the Web as a Corpus , 2008, ACL.

[23]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[24]  C SchankRoger,et al.  Dynamic Memory: A Theory of Reminding and Learning in Computers and People , 1983 .

[25]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[26]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[27]  Alan Akbik,et al.  The Weltmodell: A Data-Driven Commonsense Knowledge Base , 2014, LREC.

[28]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[29]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[30]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.