Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history

Person names and location names are essential building blocks for identifying events and social networks in historical documents that were written in literary Chinese. We take the lead to explore the research on algorithmically recognizing named entities in literary Chinese for historical studies with language-model based and conditional-random-field based methods, and extend our work to mining the document structures in historical documents. Practical evaluations were conducted with texts that were extracted from more than 220 volumes of local gazetteers (Difangzhi, $$$). Difangzhi is a huge and the single most important collection that contains information about officers who served in local government in Chinese history. Our methods performed very well on these realistic tests. Thousands of names and addresses were identified from the texts. A good portion of the extracted names match the biographical information currently recorded in the China Biographical Database (CBDB) of Harvard University, and many others can be verified by historians and will become as new additions to CBDB.1

[1]  Chao-Lin Liu,et al.  Toward Algorithmic Discovery of Biographical Information in Local Gazetteers of Ancient China , 2015, PACLIC.

[2]  James M. Hargett,et al.  Song Dynasty Local Gazetteers and Their Place in The History of Difangzhi Writing , 1996 .

[3]  John Lee,et al.  A Dependency Treebank of Classical Chinese Poems , 2012, NAACL.

[4]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[5]  Rebecca Hwa Supervised Grammar Induction using Training Data with Limited Constituent Information , 1999, ACL.

[6]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[7]  Mark Steedman,et al.  Grammar Induction from Text Using Small Syntactic Prototypes , 2011, IJCNLP.

[8]  Jieh Hsiang,et al.  Prosopographical Databases, Text-Mining, GIS and System Interoperability for Chinese History and Literature , 2012, DH.

[9]  Colin de la Higuera,et al.  A bibliographical study of grammatical inference , 2005, Pattern Recognit..

[10]  Chao-Lin Liu,et al.  Mining and discovering biographical information in Difangzhi with a language-model-based approach , 2015, ArXiv.

[11]  Kun Yu,et al.  Semi-automatically Developing Chinese HPSG Grammar from the Penn Chinese Treebank for Deep Parsing , 2010, COLING.

[12]  Tiejun Zhao,et al.  Learning Chinese Bracketing Knowledge Based on a Bilingual Language Model , 2002, COLING.

[13]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[14]  Huan Wang,et al.  PCFG Parsing for Restricted Classical Chinese Texts , 2002, SIGHAN@COLING.

[15]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[16]  Hsin-Hsi Chen,et al.  Classical Chinese Sentence Segmentation , 2010, CIPS-SIGHAN.