论文信息 - Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

In this paper, we deal with the problem of extracting and processing useful informa- tion from bibliographic references in Digital Humanities (DH) data. We present our ongoing project BILBO, supported by Google Grant for Digital Humanities that includes the constitu- tion of proper reference corpora and construction of efficient annotation model using several appropriate machine learning techniques. Conditional Random Field is used as a basic ap- proach to automatic annotation of reference fields and Support Vector Machine with a set of newly proposed features is applied for sequence classification. A number of experiments are conducted to find one of the best feature settings for CRF model on these corpora. RESUME.L'extraction d'informations bibliographiques depuis un texte non structure demeure un probleme ouvert que nous abordons, via des approches d'apprentissage automatique, dans le domaine des Humanites Numeriques. Nous presentons dans cet article le projet BILBO,soutenu par un Google Digital Humanities Award avec le soutien du projet ANR CAAS : constitution de 3 corpus de reference correspondant a trois localisations des references, elaboration d'un modele d'annotation puis evaluation. Les champs aleatoires conditionnels (CRFs) sont utilises pour l'annotation des references bibliographiques et des machines a vecteurs supports (SVMs) pour l'identification des references au sein du texte. De nombreuses experiences sont conduites afin de determiner les meilleures proprietes devant etre exploitees par les modeles numeriques.

Patrice Bellot | Young-Min Kim | Elodie Faath | Marin Dacos

[1] Shih-Hung Wu,et al. A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[2] C. Lee Giles,et al. CiteSeer: an automatic citation indexing system , 1998, DL '98.

[3] Jian Pei,et al. A brief survey on sequence classification , 2010, SKDD.

[4] Andrew McCallum,et al. An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[5] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6] Patrice Lopez,et al. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[7] Roni Rosenfeld,et al. Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[8] Andrew McCallum,et al. Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[9] C. Lee Giles,et al. ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[10] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11] Thorsten Joachims,et al. Making large-scale support vector machine learning practical , 1999 .