Collecting Conceptualized Relations from Terabytes of Web Texts for Understanding Unknown Terms

This paper describes our attempt to extract various relations between super ordinate concepts from terabytes of Web corpus for human-like speculation of the meaning of unknown terms. In order to discover various conceptualized relations, we focus on Web-scale text corpora and introduce a simple string-matching method to process them. To derive relations between concepts, our method first extracts relations between terms and next replaces each term by appropriate concepts using Wikipedia, Word Net, and YAGO knowledge. We extracted over 10 million relations between concepts in a day from more than 10TB of Web texts using 100 machines. Experimental results revealed that extracted relations by our method contained much more meaningless relations than those by NLP-based methods. Nevertheless, they were useful in an application of speculating the meaning of unknown terms, improving the recall by more than 0.06 points and decreasing the accuracy by only 0.04 points (the improvement of the F1-measure was 0.03 points). We found from the results that the coverage of conceptualized relations is important to improve the precision in the application. This is because the lack of knowledge (conceptualized relations) leads to misunderstanding of the meaning of unknown terms, as we humans misunderstand things with our insufficient knowledge.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[3]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[4]  Ralph Grishman,et al.  Towards Large-Scale Unsupervised Relation Extraction from the Web , 2012, Int. J. Semantic Web Inf. Syst..

[5]  Danushka Bollegala,et al.  Relational duality: unsupervised extraction of semantic relations between entities on the web , 2010, WWW '10.

[6]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[7]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[8]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[9]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[10]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[11]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[12]  Estevam R. Hruschka,et al.  Discovering Relations between Noun Categories , 2011, EMNLP.

[13]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[14]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[15]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[16]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[17]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[18]  Phillip Rowles Teaching and Learning Vocabulary , 2003 .

[19]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[20]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[21]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[22]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[23]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[24]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[25]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[26]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[27]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[28]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[29]  Fabian M. Suchanek,et al.  Inside YAGO2s: a transparent information extraction architecture , 2013, WWW '13 Companion.

[30]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[31]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.