Toward Algorithmic Discovery of Biographical Information in Local Gazetteers of Ancient China

Difangzhi (地方志) is a large collection of local gazetteers complied by local governments of China, and the documents provide invaluable information about the host locality. This paper reports the current status of using natural language processing and text mining methods to identify biographical information of government officers so that we can add the information into the China Biographical Database (CBDB), which is hosted by Harvard University. Information offered by CBDB is instrumental for human historians, and serves as a core foundation for automatic tagging systems, like MARKUS of the Leiden University. Mining texts in Difangzhi is not easy partially because there is litter knowledge about the grammars of literary Chinese so far. We employed techniques of language modeling and conditional random fields to find person and location names and their relationships. The methods were evaluated with realistic Difangzhi data of more than 2 million Chinese characters written in literary Chinese. Experimental results indicate that useful information was discovered from the current dataset.

[1]  Jieh Hsiang,et al.  Prosopographical Databases, Text-Mining, GIS and System Interoperability for Chinese History and Literature , 2012, DH.

[2]  Chao-Lin Liu,et al.  Mining and discovering biographical information in Difangzhi with a language-model-based approach , 2015, ArXiv.

[3]  Kun Yu,et al.  Semi-automatically Developing Chinese HPSG Grammar from the Penn Chinese Treebank for Deep Parsing , 2010, COLING.

[4]  Tiejun Zhao,et al.  Learning Chinese Bracketing Knowledge Based on a Bilingual Language Model , 2002, COLING.

[5]  John Lee,et al.  A Dependency Treebank of Classical Chinese Poems , 2012, NAACL.

[6]  Huan Wang,et al.  PCFG Parsing for Restricted Classical Chinese Texts , 2002, SIGHAN@COLING.

[7]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[8]  James M. Hargett,et al.  Song Dynasty Local Gazetteers and Their Place in The History of Difangzhi Writing , 1996 .

[9]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[10]  Rebecca Hwa Supervised Grammar Induction using Training Data with Limited Constituent Information , 1999, ACL.

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Mark Steedman,et al.  Grammar Induction from Text Using Small Syntactic Prototypes , 2011, IJCNLP.

[14]  Colin de la Higuera,et al.  A bibliographical study of grammatical inference , 2005, Pattern Recognit..

[15]  Chao-Lin Liu,et al.  Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Hsin-Hsi Chen,et al.  Classical Chinese Sentence Segmentation , 2010, CIPS-SIGHAN.

[17]  A. F. Adams,et al.  The Survey , 2021, Dyslexia in Higher Education.