The LODIE team (University of Sheffield) Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track

This paper describes the LODIE team (from the OAK lab of the University of Sheffield) participation at TAC-KBP 2015 for the Entity Discovery task in the Cold Start KBP track. We have taken a cross-document coreference resolution approach that starts with Named Entity Recognition to locate and classify mentions of named entities, followed by a clustering procedure that groups mentions referring to the same entity. Our primary interest was studying different features and their effect on the clustering process, as well as scalable methods to cope with very large data. We experimented with several feature combinations and conclude that the best results are obtained using features based on entity surface forms and distributed word embeddings. To cope with large scale data, the clustering process takes a two-step approach to break data to smaller batches. Our method on the 2015 evaluation dataset obtains a best CEAF mention F-measure of 63.21.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Alexander Ulanov,et al.  Company Names Matching in the Large Patents Dataset , 2011 .

[3]  David Yarowsky,et al.  Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[4]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[5]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[6]  Dan Roth,et al.  Understanding the Value of Features for Coreference Resolution , 2008, EMNLP.

[7]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  Isabelle Augenstein,et al.  "Linked data as background knowledge for information extraction on the web" by Ziqi Zhang, Anna Lisa Gentile and Isabelle Augenstein with Martin Vesely as coordinator , 2014, LINK.

[10]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[13]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[14]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[15]  Dan Klein,et al.  Coreference Resolution in a Modular, Entity-Centered Model , 2010, NAACL.

[16]  Dan Klein,et al.  Simple Coreference Resolution with Rich Syntactic and Semantic Features , 2009, EMNLP.

[17]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[18]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[19]  Mark Dredze,et al.  Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[20]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[21]  Matteo Magnani,et al.  A study on company name matching for database integration , 2007 .

[22]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[23]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.