Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries

Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from scanned images of bilingual dictionaries would expedite the prototyping process of cross-language systems. Printed dictionaries have a logical model that defines the syntax of the dictionary entries – i.e. order of the dictionary entry, its part of speech, its pronunciation and its definition. In this article we propose an algorithm to automatically extract bilingual dictionary entries based on stochastic language models. We demonstrate this algorithm on a printed Chinese-English dictionary. This work can be easily used for extracting information from other tabular structures like telephone books, catalogs, etc.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  King-Sun Fu,et al.  Syntactic Methods in Pattern Recognition , 1974, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Daniel X. Le,et al.  Automated labeling in document images , 2000, IS&T/SPIE Electronic Imaging.

[5]  Marcel Worring,et al.  Logical structure detection for heterogeneous document classes , 2000, IS&T/SPIE Electronic Imaging.

[6]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[7]  Song Mao,et al.  Stochastic language model for analyzing document physical layout , 2001, IS&T/SPIE Electronic Imaging.

[8]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[10]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Alan Conway,et al.  Page grammars and page parsing. A syntactic approach to document layout recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[13]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[14]  Henry S. Baird,et al.  Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[15]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[16]  Matthew Hurst Layout and language: an efficient algorithm for detecting text blocks based on spatial and linguistic evidence , 2000, IS&T/SPIE Electronic Imaging.

[17]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .