Physical and logical structure of printed bilingual dictionary items: Linguistic representation and recognition

Parsing bilingual dictionaries is important for building cross-language retrieval systems and speech recognition algorithms. We describe a general purpose algorithm that can be easily modified to convert printed bilingual dictionaries in various layouts and language pairs into electronic/symbolic lexicons. In a previous paper [SPIE Document Recognition and Retrieval, San Jose, January 2002], we described an algorithm for segmenting the physical layout of dictionaries into columns and lines. In this paper we assume that the physical lines are given then recognize the lines that constitute a dictionary item. Furthermore, the algorithm simultaneously recognizes the logical structure within the dictionary items (head-word, pronounciation, part of speech and definition). We demonstrate our algorithm on 30 scanned Chinese-English dictionary pages which include more than 2500 lexicon items.