The paper presents the technology of building a large German-French parallel corpus consisting of official documents of the European Union and Switzerland, and private and public organisations in France and Germany. The texts are morphosyntactically annotated, aligned at the sentence level and marked up in conformance with the TEI guidelines for standardised representation. The multilevel alignment method is applied; its precision is improved due to the correlation with the constraints of the classical alignment method of Gale and Church. The alignment information is encoded externally to the parallel text documents. The process of creating the corpus is an interesting algorithm of applying a number of software tools and adjusting intermediate production results.
[1]
C. M. Sperberg-McQueen,et al.
Guidelines for electronic text encoding and interchange
,
1994
.
[2]
Jean Véronis,et al.
Text Encoding Initiative
,
1995,
Springer Netherlands.
[3]
Kenneth Ward Church,et al.
A Program for Aligning Sentences in Bilingual Corpora
,
1993,
CL.
[4]
C. M. Sperberg-McQueen.
The Text Encoding Initiative
,
1994
.
[5]
Jean Véronis,et al.
Parallel Text Processing
,
2000
.
[6]
Laurent Romary,et al.
Parallel alignment of structured documents
,
2000
.