This paper presents work in progress to automatically extract quotation sentences from newspaper articles. The focus is the extraction and annotation of unmarked quotation sentences. A linguistic study shows that unmarked quotation sentences can be formalised into 16 patterns that can be used to develop an extraction grammar. The question of unmarked quotation boundaries identification is also raised as they are often ambiguous. An annotation scheme allowing to describe all the elements that can take place in a quotation sentence is defined. This paper presents the creation of two resources necessary to our system. A dictionary of verbs introducing quotations has been automatically built using a grammar of marked quotations sentences to identify the verbs able to introduce quotations. A grammar formalising the patterns of unmarked quotation sentences ― using the tool Unitex, based on finite state machines ― has been developed. A short experiment has been performed on two patterns and shows some promising results.
[1]
D. Maingueneau,et al.
Les termes clés de l'analyse du discours
,
2009
.
[2]
Jean-Michel Adam,et al.
Dictionnaire d'analyse du discours
,
2002
.
[3]
Steinberger Ralf,et al.
Automatic Detection of Quotations in Multilingual News
,
2007
.
[4]
Jean-Pierre Desclés,et al.
Automatic Annotation of Direct Reported Speech in Arabic and French, According to a Semantic Map of Enunciative Modalities
,
2008,
GoTAL.
[5]
Laurence Danlos,et al.
Verbes de citation et Tables du Lexique-Grammaire
,
2010
.
[6]
Yann Mathet,et al.
La plate-forme Glozz : environnement d’annotation et d’exploration de corpus
,
2009,
JEPTALNRECITAL.
[7]
B. Sagot,et al.
Analyse discursive des incises de citation
,
2010
.