Filtering and clustering relations for unsupervised information extraction in open domain

Information Extraction has recently been extended to new areas by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, we focus on the task of extracting and characterizing a priori unknown relations between a given set of entity types. One of the challenges of this task is to deal with the large amount of candidate relations when extracting them from a large corpus. We propose in this paper an approach for the filtering of such candidate relations based on heuristics and machine learning models. More precisely, we show that the best model for achieving this task is a Conditional Random Field model according to evaluations performed on a manually annotated corpus of about one thousand relations. We also tackle the problem of identifying semantically similar relations by clustering large sets of them. Such clustering is achieved by combining a classical clustering algorithm and a method for the efficient identification of highly similar relation pairs. Finally, we evaluate the impact of our filtering of relations on this semantic clustering with both internal measures and external measures. Results show that the filtering procedure doubles the recall of the clustering while keeping the same precision.

[1]  Ronen Feldman,et al.  Clustering for unsupervised relation identification , 2007, CIKM '07.

[2]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[3]  Aldo Gangemi,et al.  Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology , 2005, IJCAI.

[4]  Naoaki Okazaki,et al.  Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web , 2009, ACL.

[5]  Jordi Turmo,et al.  Unsupervised Relation Extraction by Massive Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[6]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[7]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[8]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[9]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[10]  Romaric Besançon,et al.  Using Temporal Cues for Segmenting Texts into Events , 2010, IceTAL.

[11]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[12]  Mehdi Embarek,et al.  Learning Patterns for Building Resources about Semantic Relations in the Medical Domain , 2008, LREC.

[13]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[14]  Günter Neumann,et al.  Unsupervised Relation Extraction From Web Documents , 2008, LREC.

[15]  Oren Etzioni,et al.  Strategies for lifelong knowledge extraction from the web , 2007, K-CAP '07.

[16]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[17]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[18]  S. Dongen Graph clustering by flow simulation , 2000 .

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[23]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[24]  Ossama Emam,et al.  Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement , 2006, EMNLP.

[25]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[26]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[27]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[28]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[29]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.