The paper introduces an on-going research project of entity identification in open (publicly available) source documents where part of the identifying attributes have been redacted. The project proof-of-concept focuses on published obituary notices as the target source, and the decedent and other individuals listed in the notice as the identities to be resolved. This paper describes an identification process that utilizes identity reference sources to create lists of candidate identities based on the partial set of identity attributes found in the published text. The identity resolution is accomplished by finding those cases where the complete set of identity attributes found in two different candidate identities are the same. Based on preliminary work using notices where the identity can be confirmed, the initial results appear promising. The paper also describes some of the challenges encountered in automating process of extracting identity attribute features from open text, and in scaling the resolution process to large numbers of notices.
[1]
E. F. Codd,et al.
A relational model of data for large shared data banks
,
1970,
CACM.
[2]
Andrew McCallum,et al.
Maximum Entropy Markov Models for Information Extraction and Segmentation
,
2000,
ICML.
[3]
Jennifer Widom,et al.
Swoosh: a generic approach to entity resolution
,
2008,
The VLDB Journal.
[4]
Peter P. Chen.
The entity-relationship model: toward a unified view of data
,
1975,
VLDB '75.