From Parallel to Comparable Text Corpora

We present a bilingual corpus management system under development in Pisa. The first component of this system was a set of procedures to create and query parallel text archives; we are now studying the implementation of a second set of procedures to interrogate comparable archives. The approach followed is quite different from that used for parallel data and considerably more complex; the results are also very different. In the paper, we describe the strategy we are adopting to retrieve significant data from comparable corpora, and discuss the preliminary results.