Constructing a Focused Taxonomy from a Document Collection

We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain.

[1]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[2]  Mónica Marrero,et al.  Evaluation of Named Entity Extraction Systems , 2009 .

[3]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[4]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[5]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[6]  Michael J. Witbrock,et al.  Searching for Common Sense: Populating Cyc™ from the Web , 2005, AAAI.

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[9]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[10]  James A. Hendler,et al.  The Semantic Web — ISWC 2002 , 2002, Lecture Notes in Computer Science.

[11]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[12]  Marti A. Hearst,et al.  Automating Creation of Hierarchical Faceted Metadata Structures , 2007, NAACL.

[13]  Isabelle Augenstein,et al.  LODifier: Generating Linked Data from Unstructured Text , 2012, ESWC.

[14]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[15]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[16]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[17]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[18]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[19]  Panagiotis G. Ipeirotis,et al.  Automatic Extraction of Useful Facet Hierarchies from Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.