International business intelligence processing is an important problem of cross-disciplinary research in artificial intelligence. The recognition of Out-Of-Vocabulary (OOV in short) in international commercial activities and its derivate OOV phrase brings challenge to widely used machine translation technology. Electronic dictionary with a fixed lexicon cannot catch up with the fast increase of international commercial OOV phrase. In this paper, we present a recognition and translation technology for OOV phrases in international business intelligence based on sentence-aligned web corpus. We first obtain the latest and most related textual resource from the Internet and build up a sentence-aligned corpus. Then calculate the relevancy of adjacent word string by Markov model to get a maximum likelihood of segmentation, and determine the OOV and OOV phrase in such business context. Then wipe off the redundancy and calculate the probabilities and weight of co-occurrence word pairs. Thus we have the OOV word pair and the translation of OOV phrase in business intelligence. Experiments show a good result in international business domain and timely update.
[1]
Yao Jianmin.
Study on Chinese OOV Identification Based on Extension
,
2009
.
[2]
Jia Ziyan.
Probabilistic Techniques and Rule Methods for New Word Discovery
,
2004
.
[3]
Yao Jianmin.
Study on OOV Translation Mining from Parallel Corpora and the Web
,
2010
.
[4]
Sun Yinghong.
Automatic Recognition of Chinese Place Names
,
2006
.
[5]
Dan Jurafsky,et al.
Statistical Natural Language Processing
,
2010,
Encyclopedia of Machine Learning.
[6]
Andrew McCallum,et al.
Chinese Segmentation and New Word Detection using Conditional Random Fields
,
2004,
COLING.
[7]
Meng Sun,et al.
Study on Word Alignment for Reordering of Web-mined OOV Translation Candidates
,
2008
.
[8]
Florentina Hristea.
Statistical Natural Language Processing
,
2011,
International Encyclopedia of Statistical Science.