A Methodology for Cross-document Coreference Cross-document Coreference: the Problem Architecture and the Methodology

Amit Bagga General Electric CRD, PO Box 8 Schenectady, NY 12301 bagga@crd.ge.com Alan W. Biermann Dept. of Computer Science Duke University Durham, NC 27708 awb@cs.duke.edu Cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source. Computer recognition of this phenomenon is important because it helps break \the document boundary" by allowing a user to examine information about a particular entity or event from multiple text sources at the same time. Resolving cross-document coreference has been considered to be a di cult problem that requires the output of an information extraction system(Grishman 1994). However, recent research has shown that crossdocument coreferences for both entities and events can be resolved accurately(Bagga & Baldwin 1998b), (Bagga & Baldwin 1999). Cross-Document Coreference: The Problem Cross-document coreference is a distinct technology from Named Entity recognizers like IsoQuest's NetOwl and IBM's Textract because it attempts to determine whether name matches are actually the same individual (not all John Smiths are the same). Neither NetOwl nor Textract have mechanisms which try to keep samenamed individuals distinct if they are di erent people. Cross-document coreference also di ers in substantial ways from within-document coreference. Within a document there is a certain amount of consistency which cannot be expected across documents. In addition, the problems encountered during within document coreference are compounded when looking for coreferences across documents because the underlying principles of linguistics and discourse context no longer apply across documents. Because the underlying assumptions in cross-document coreference are so distinct, they require novel approaches. Architecture and the Methodology In this section we describe a cross-document coreference resolution system that resolves coreferences for both entities and events using the Vector Space Model. Figure 1 shows the architecture of the system which is built upon the University of Pennsylvania's within document coreference system, CAMP (Baldwin & others 1995) (Baldwin & others 1998). John Perry, of Weston Golf Club, announced his resignation yesterday. He was the President of the Massachusetts Golf Association. During his two years in o ce, Perry guided the MGA into a closer relationship with the Women's Golf Association of Massachusetts. Figure 2: Extract from doc.36