Modèles de langage probabilistes et possibilistes basés sur le Web

Language models are usually built either from a closed corpus, or by using World Wide Web retrieved documents , which are considered as a closed corpus themselves. In this paper we propose several other ways of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks : (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations.