Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpus based on a cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering each sentence of the source text as a "query" to that database. The cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents to be indexed. This system is composed of a multilingual linguistic analyzer, a statistical analyzer, a reformulator, a comparator and a search engine. The multilingual linguistic analyzer includes a morphological analyzer, a part-of-speech tagger and a syntactic analyzer. The linguistic analyzer processes both documents to be indexed and queries to produce a set of normalized lemmas, a set of named entities and a set of nominal compounds with their morpho-syntactic tags. The statistical analyzer computes for documents to be indexed concept weights based on concept database frequencies. The comparator computes intersections between queries and documents and provides a relevance weight for each intersection. Before this comparison, the reformulator expands queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The search engine retrieves the ranked, relevant documents from the indexes according to the corresponding reformulated query and then merges the results obtained for each language, taking into account the original words of the query and their weights in order to score the documents. The sentence aligner has been evaluated on the MD corpus of the ARCADE II project which is composed of news articles from the French newspaper "Le Monde Diplomatique". The part of the corpus used in evaluation consists of the same subset of sentences in Arabic and French. Arabic sentences are aligned to their French counterparts. Results showed that alignment has correct precision and recall even when the corpus is not completely parallel (changes in sentence order or missing sentences).
[1]
Gregory Grefenstette,et al.
Cross-Language Information Retrieval
,
1998,
The Springer International Series on Information Retrieval.
[2]
Lisa Ballesteros,et al.
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
,
2002,
SIGIR '02.
[3]
Robert L. Mercer,et al.
Aligning Sentences in Parallel Corpora
,
1991,
ACL.
[4]
Romaric Besançon,et al.
Concept-Based Searching and Merging for Multilingual Information Retrieval: First Experiments at CLEF 2003
,
2003,
CLEF.
[5]
Kareem Darwish,et al.
Building a Shallow Arabic Morphological Analyser in One Day
,
2002,
SEMITIC@ACL.
[6]
Jean Véronis,et al.
Evaluation of multilingual text alignment systems: the ARCADE II project
,
2006,
LREC.
[7]
Stelios Piperidis,et al.
Automatic Alignment in Parallel Corpora
,
1994,
ACL.
[8]
Saleem Abuleil,et al.
Named Entity Recognition and Classification for Text in Arabic
,
2004,
IASSE.
[9]
Martin Kay,et al.
Text-Translation Alignment
,
1993,
Comput. Linguistics.
[10]
Kenneth Ward Church,et al.
A Program for Aligning Sentences in Bilingual Corpora
,
1993,
CL.
[11]
Michael McGill,et al.
Introduction to Modern Information Retrieval
,
1983
.
[12]
Christian Fluhr,et al.
Parallel text alignment using crosslingual information retrieval techniques
,
2000
.
[13]
Éric Gaussier.
Modeles statistiques et patrons morphosyntaxiques pour l'extraction de lexiques bilingues
,
1995
.
[14]
Nasredine Semmar,et al.
Modifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications
,
2005,
SEMITIC@ACL.
[15]
I. Dan Melamed,et al.
A Geometric Approach to Mapping Bitext Correspondence
,
1996,
EMNLP.