论文信息 - Automatic annotation of bibliographical references in digital humanities books, articles and blogs

Automatic annotation of bibliographical references in digital humanities books, articles and blogs

In this paper, we deal with the problem of extracting and processing useful information from bibliographic references in Digital Humanities (DH) data. A machine learning technique for sequential data analysis, Conditional Random Field is applied to a corpus extracted from OpenEdition site, a web platform for journals and book collections in the humanities and social sciences. We present our ongoing project with this purpose that includes the construction of a proper corpus and a efficient CRF model on this as a preliminary. This project is supported by Google Grant for Digital Humanities. A number of experiments are conducted to find one of the best settings for a CRF model on the corpus, and we verify them both in an automatic and manual way of evaluation.

Patrice Bellot | Young-Min Kim | Elodie Faath | Marin Dacos

[1] Neil R. Smalheiser,et al. Author name disambiguation in MEDLINE , 2009, TKDD.

[2] Andrew McCallum,et al. Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[3] Andrew McCallum,et al. An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[4] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6] Raphael Volz,et al. Towards Ontology-based Disambiguation of Geographical Identifiers , 2007, I3.

[7] Wei Xu,et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.

[8] Shih-Hung Wu,et al. A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..