Cross-Language Sensitive Words Distribution Map: A Novel Recognition-Based Document Understanding Method for Uighur and Tibetan

Cross-language document recognition and understanding have urgent realistic needs and extensive application prospects. In this paper, we propose a novel recognition-based Uighur and Tibetan document understanding method, termed "cross-language sensitive words distribution map" (CSWDM). In our unified recognition-understanding framework, digital Uighur/Tibetan document images are first recognized using OCR technology, and then CSWDM labels the Chinese information of sensitive words on the recognized transcriptions or directly on the original digital images, thus the space location and occurrence frequency of these sensitive words can be intuitively represented. With such information, readers can roughly understand the theme and meaning of the cross-language documents.

[1]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[2]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[3]  Qi Tian,et al.  Lp-Norm IDF for Large Scale Image Search , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[5]  Hua Wang,et al.  Printed Arabic document recognition system , 2005, IS&T/SPIE Electronic Imaging.

[6]  Hua Wang,et al.  New statistical method for machine-printed Arabic character recognition , 2005, IS&T/SPIE Electronic Imaging.

[7]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[8]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[9]  Liangrui Peng,et al.  Document digitization technology and its application for digital library in China , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[10]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Hua Wang,et al.  Multilingual document recognition research and its application in China , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[12]  Liangrui Peng,et al.  SemiBoost-based Arabic character recognition method , 2011, Electronic Imaging.