A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution

Finding information about people on huge text collections or on-line repositories on the Web is a common activity. We describe experiments aiming at identifying the contribution of semantic information (e.g., named entities) and summarization (e.g., sentence extracts) in a cross-document coreference resolution system. Our system uses a clustering-based algorithm to group documents referring to the same entity. Clustering uses vector representations created by summarization and semantic tagging components. We investigate different clustering configurations and show that selection of the type of summary and the type of term to be used for vector representation is important to achieve good performance.

[1]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[2]  Horacio Saggion,et al.  SUMMA. A Robust and Adaptable Summarization Tool , 2008, TAL.

[3]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[4]  Wai Lam,et al.  Developing Infrastructure for the Evaluation of Single and Multi-document Summarization Systems in a Cross-lingual Environment , 2002, LREC.

[5]  Susumu Horiguchi,et al.  Personal Name Resolution Crossover Documents by a Semantics-Based Approach , 2006, IEICE Trans. Inf. Syst..

[6]  Ying Chen,et al.  CU-COMSEM: Exploring Rich Features for Unsupervised Web Personal Name Disambiguation , 2007, SemEval@ACL.

[7]  Edie M. Rasmussen,et al.  Non-hierarchical document clustering using the ICL distribution array processor , 1987, SIGIR '87.

[8]  Horacio Saggion,et al.  Multi-document summarization by cluster/prole relevance and redundancy removal , 2004 .

[9]  Gerald Salton,et al.  Automatic text processing , 1988 .

[10]  Kalina Bontcheva,et al.  Mining Information for Instance Unification , 2006, SEMWEB.

[11]  Horacio Saggion SHEF: Semantic Tagging and Summarization Techniques Applied to Cross-document Coreference , 2007, SemEval@ACL.

[12]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[13]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[14]  Kalina Bontcheva,et al.  Architectural elements of language engineering robustness , 2002, Natural Language Engineering.

[15]  Julio Gonzalo,et al.  WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks , 2010, CLEF.

[16]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[17]  Simone Teufel,et al.  A Bootstrapping Approach to Unsupervised Detection of Cue Phrase Variants , 2006, ACL.

[18]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[19]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[20]  Horacio Saggion Experiments on Semantic-based Clustering for Cross-document Coreference , 2008, IJCNLP.

[21]  Massimo Poesio,et al.  A Corpus for Cross-Document Co-reference , 2008, LREC.

[22]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[23]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[24]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.