Towards Automatic Retrieval of Idioms in French Newspaper Corpora

The goal of this paper is to present a procedure for the automatic retrieval of idiomatic expressions from large text corpora. The procedure combines text segmentation techniques and Latent semantic analysis. Three indices were computed on the basis of the three-fold hypothesis that: (1) idiomatic expressions should have few neighbours; (2) idiomatic expressions should demonstrate low semantic proximity between the words composing them; (3) idiomatic expressions should demonstrate low semantic proximity between the expression and the preceding and subsequent segments. The result of this procedure shows that we have not yet reached a fully automatic retrieval of idioms from large corpora, but this first trial has shown that we are on the way. The procedure reduces the amount of data to consider to less than a quarter (23.8 per cent) of the original data, of which one-fifth (20.9 per cent) is idiomatic, and nearly 60 per cent (58.8 per cent) is phraseological in nature. In other words, this procedure drastically improves and facilitates hand-based retrieval. In addition, these first results already permit some linguistic exploitation of the retrieved idioms.