Knowledge Discovery with CRF-Based Clustering of Named Entities without a Priori Classes

Knowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates defining a notion of similarity between the relevant entities. In this paper, we propose to divert a supervised machine learning technique namely Conditional Random Fields, widely used for supervised labeling tasks in order to calculate, indirectly and without supervision, similarities among text sequences. Our approach consists in generating artificial labeling problems on the data to reveal regularities between entities through their labeling. We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks. The results demonstrate the usefulness of this unsupervised approach, and open many avenues for defining similarities for complex representations of textual data.

[1]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[2]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[3]  M. Cugmas,et al.  On comparing partitions , 2015 .

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[6]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[7]  Vincent Claveau,et al.  Annotating Football Matches: Influence of the Source Medium on Manual Annotation , 2012, LREC.

[8]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[9]  Romaric Besançon,et al.  Filtering and clustering relations for unsupervised information extraction in open domain , 2011, CIKM '11.

[10]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL.

[11]  S. Dongen Graph clustering by flow simulation , 2000 .

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Matthieu Constant,et al.  Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteu r du français , 2011 .

[14]  Balaraman Ravindran,et al.  Part Of Speech Tagging and Chunking with HMM and CRF , 2006 .

[15]  Tao Wang,et al.  Semantic Event Detection using Conditional Random Fields , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[16]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[17]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[18]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[19]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[20]  Micha Elsner,et al.  Structured Generative Models for Unsupervised Named-Entity Clustering , 2009, HLT-NAACL.

[21]  Romaric Besançon,et al.  Evaluation of Unsupervised Information Extraction , 2012, LREC.

[22]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[23]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[24]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[25]  Zornitsa Kozareva Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists , 2006, EACL.

[26]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[27]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[28]  Heng Ji,et al.  Knowledge Base Population: Successful Approaches and Challenges , 2011, ACL.

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  Vincent Claveau,et al.  Semantic Clustering using Bag-of-Bag-of-Features , 2012, CORIA.

[31]  Benoît Favre,et al.  Semi-supervised part-of-speech tagging in speech applications , 2010, INTERSPEECH.

[32]  C. Raymond,et al.  Reconnaissance robuste d’entités nommées sur de la parole transcrite automatiquement , 2010, JEPTALNRECITAL.