Probase: a probabilistic taxonomy for text understanding

Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing ontologies has the needed depth and breadth for universal understanding. In this paper, we present a universal, probabilistic taxonomy that is more comprehensive than any existing ones. It contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional taxonomies that treat knowledge as black and white, it uses probabilities to model inconsistent, ambiguous and uncertain information it contains. We present details of how the taxonomy is constructed, its probabilistic modeling, and its potential applications in text understanding.

[1]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[2]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[3]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[4]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..

[5]  Michael Fleischman Automated Subcategorization of Named Entities , 2001, ACL.

[6]  Daniel Jurafsky,et al.  Learning to Merge Word Senses , 2007, EMNLP.

[7]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[8]  C. Elkan,et al.  Topic Models , 2008 .

[9]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[10]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[11]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[14]  Jiawei Han,et al.  Optimizing index for taxonomy keyword search , 2012, SIGMOD Conference.

[15]  Paul Bloom Glue for the mental world , 2003, Nature.

[16]  Amit P. Sheth,et al.  Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[17]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[18]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[19]  Daphne Koller,et al.  Probabilistic Abstraction Hierarchies , 2001, NIPS.

[20]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[21]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[22]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[23]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[24]  Haixun Wang,et al.  Managing and mining large graphs: systems and implementations , 2012, SIGMOD Conference.

[25]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[26]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[27]  Steffen Staab,et al.  Learning Taxonomic Relations from Heterogeneous Sources of Evidence , 2005 .

[28]  Michael J. Witbrock,et al.  Searching for Common Sense: Populating Cyc™ from the Web , 2005, AAAI.

[29]  Haixun Wang,et al.  Toward Topic Search on the Web , 2011 .

[30]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[31]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[32]  Haixun Wang,et al.  Understanding Short Texts , 2013, APWeb.

[33]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[34]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[35]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[36]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[37]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[38]  Haixun Wang,et al.  The Trinity Graph Engine , 2012 .