Parallel Community Detection for Cross-Document Coreference

This paper presents a highly parallel solution for cross-document co reference resolution, which can deal with billions of documents that exist in the current web. At the core of our solution lies a novel algorithm for community detection in large scale graphs. We operate on graphs which we construct by representing documents' keywords as nodes and the colocation of those keywords in a document as edges. We then exploit the particular nature of such graphs where co referent words are topologically clustered and can be efficiently discovered by our community detection algorithm. The accuracy of our technique is considerably higher than that of the state of the art, while the convergence time is by far shorter. In particular, we increase the accuracy for a baseline dataset by more than 15% compared to the best reported result so far. Moreover, we outperform the best reported result for a dataset provided for the Word Sense Induction task in SemEval 2010.

[1]  Mark Dredze,et al.  Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[2]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[4]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Word Sense Induction &Disambiguation , 2010, SemEval@ACL.

[5]  Yuji Matsumoto,et al.  A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields , 2007, EMNLP.

[6]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[7]  Michael Strube,et al.  End-to-End Coreference Resolution via Hypergraph Partitioning , 2010, COLING.

[8]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[9]  David Jurgens,et al.  Word Sense Induction by Community Detection , 2011, Graph-based Methods for Natural Language Processing.

[10]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[11]  KarypisGeorge,et al.  Multilevelk-way Partitioning Scheme for Irregular Graphs , 1998 .

[12]  Peter Sanders,et al.  Engineering Multilevel Graph Partitioning Algorithms , 2010, ESA.

[13]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[14]  Eric V. Denardo,et al.  Flows in Networks , 2011 .

[15]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[16]  Vincent Ng,et al.  Supervised Noun Phrase Coreference Research: The First Fifteen Years , 2010, ACL.

[17]  Edith Bolling Anaphora Resolution , 2006 .

[18]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[19]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[20]  Zunaid Kazi,et al.  Is Hillary Rodham Clinton the President? Disambiguating Names across Documents , 1999, COREF@ACL.

[21]  Vincent Ng,et al.  Syntactic Parsing for Ranking-Based Coreference Resolution , 2011, IJCNLP.

[22]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[23]  Erdogan Dogdu,et al.  Named entity recognition and disambiguation using linked data and graph-based centrality scoring , 2012, SWIM '12.

[24]  Vincent Ng,et al.  Supervised Models for Coreference Resolution , 2009, EMNLP.

[25]  Bill Keller,et al.  MaxMax: A Graph-Based Soft Clustering Algorithm Applied to Word Sense Induction , 2013, CICLing.

[26]  Thomas Sauerwald,et al.  A new diffusion-based multilevel algorithm for computing graph partitions of very high quality , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[27]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[28]  Michael Strube,et al.  A Multigraph Model for Coreference Resolution , 2012, EMNLP-CoNLL Shared Task.

[29]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[32]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[33]  Michael Strube,et al.  Unrestricted Coreference Resolution via Global Hypergraph Partitioning , 2011, CoNLL Shared Task.

[34]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[35]  Amir H. Payberah,et al.  JA-BE-JA: A Distributed Algorithm for Balanced Graph Partitioning , 2013, 2013 IEEE 7th International Conference on Self-Adaptive and Self-Organizing Systems.

[36]  Joachim Gehweiler,et al.  A distributed diffusive heuristic for clustering a virtual P2P supercomputer , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).