International business intelligence processing is an important problem of cross-disciplinary research in artificial intelligence. The recognition of Out-Of-Vocabulary (OOV in short) in international commercial activities and its derivate OOV phrase brings challenge to widely used machine translation technology. Electronic dictionary with a fixed lexicon cannot catch up with the fast increase of international commercial OOV phrase. In this paper, we present a recognition and translation technology for OOV phrases in international business intelligence based on sentence-aligned web corpus. We first obtain the latest and most related textual resource from the Internet and build up a sentence-aligned corpus. Then calculate the relevancy of adjacent word string by Markov model to get a maximum likelihood of segmentation, and determine the OOV and OOV phrase in such business context. Then wipe off the redundancy and calculate the probabilities and weight of co-occurrence word pairs. Thus we have the OOV word pair and the translation of OOV phrase in business intelligence. Experiments show a good result in international business domain and timely update.
Yao Jianmin.
Study on Chinese OOV Identification Based on Extension
Jia Ziyan.
Probabilistic Techniques and Rule Methods for New Word Discovery
Yao Jianmin.
Study on OOV Translation Mining from Parallel Corpora and the Web
Sun Yinghong.
Automatic Recognition of Chinese Place Names
Dan Jurafsky,et al.
Statistical Natural Language Processing
Encyclopedia of Machine Learning.
Andrew McCallum,et al.
Chinese Segmentation and New Word Detection using Conditional Random Fields
Meng Sun,et al.
Study on Word Alignment for Reordering of Web-mined OOV Translation Candidates
Florentina Hristea.
Statistical Natural Language Processing
International Encyclopedia of Statistical Science.