Towards population reconstruction : extraction of family relationships from historical documents

In this paper we present an approach for the automatic extraction of family relationships from a real-world collection of historical notary acts. We retrieve relationships such as husband - wife, parent - child, widow of, etc. We study two ways to deal with this problem. In our first approach, we identify all person names in a document, generate all potential candidate pairs of names and predict whether they are related to each other using classification techniques where the text fragments that occur around and between two names are sued as features. In the second approach, we train and apply a Hidden Markov Model to annotate every word in a document with an appropriate tag indicating if it is a name, a specified relationship descriptor, or neither of these. Then we look for the names connected to each other via relationship descriptors. We discuss the challenges such as processing raw data, obtaining a sufficient amount of training examples, and dealing with an imbalanced and noisy collection. We evaluate our results for each relationship type in terms of precision, recall and f - score.

[1]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[3]  Jorge Baptista,et al.  Extraction of Family Relations between Entities , 2010 .

[4]  Jacob Perkins,et al.  Python 3 text processing with NLTK 3 cookbook : over 80 practical recipes on natural language processing techniques using Python's NLTK 3.0 , 2014 .

[5]  Jian Su,et al.  Exploring Various Knowledge in Relation Extraction , 2005, ACL.

[6]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[7]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[8]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[9]  Mats Malm,et al.  Character Profiling in 19th Century Fiction , 2011 .

[10]  Hui Wang,et al.  Soft Sensing as Class-Imbalance Binary Classification - A Lattice Machine Approach , 2014, UCAmI.

[11]  Jing Jiang,et al.  Information Extraction from Text , 2012, Mining Text Data.

[12]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[13]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[14]  Renata Vieira,et al.  Extraction of Relation Descriptors for Portuguese Using Conditional Random Fields , 2014, IBERAMIA.

[15]  Toon Calders,et al.  Multi-Source Entity Resolution for Genealogical Data , 2015, Population Reconstruction.

[16]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[17]  Sean R Eddy,et al.  What is a hidden Markov model? , 2004, Nature Biotechnology.

[18]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[19]  Denilson Barbosa,et al.  Extracting Family Relationship Networks from Novels , 2014, ArXiv.

[20]  Toon Calders,et al.  Classification of Historical Notary Acts with Noisy Labels , 2015, ECIR.

[21]  Toon Calders,et al.  A Baseline Method for Genealogical Entity Resolution , 2014 .

[22]  Parma Nand,et al.  An Evaluation of POS Tagging for Tweets Using HMM Modelling , 2015, ACSC.