Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, alignment and text location features. Training sets and the mining seeds are acquired automatically. Experiment shows satisfactory parallel resource mining performance.