PARSING AND TAGGING OF BINLINGUAL DICTIONARY

Abstract : Bilingual dictionaries hold great potential as a source of lexical resources for training and testing automated systems for optical character recognition, machine translation, and cross-language information retrieval. In this paper, we describe a system for extracting term lexicons from printed bilingual dictionaries. Our work was divided into three phases - dictionary segmentation, entry tagging, and generation. In segmentation, pages are divided into logical entries based on structural features learned from selected examples. The extracted entries are associated with functional labels and passed to a tagging module which associates linguistic labels with each word or phrase in the entry. The output of the system is a structure that represents the entries from the dictionary. We have used this approach to parse a variety of dictionaries with both Latin and non-Latin alphabets, and demonstrate the results of term lexicon generation for retrieval from a collection of French news stories using English queries.

[1]  Tieniu Tan,et al.  Font Recognition Based on Global Texture Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[3]  Rolf Ingold,et al.  Optical Font Recognition Using Typographical Features , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[5]  Philip Resnik,et al.  The Bible, truth, and multilingual OCR evaluation , 1999, Electronic Imaging.

[6]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[7]  Douglas W. Oard,et al.  CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation , 2000, CLEF.

[8]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[9]  David S. Doermann,et al.  Bootstrapping structured page segmentation , 2003, IS&T/SPIE Electronic Imaging.

[10]  Ellen M. Voorhees,et al.  The Tenth Text REtrieval Conference, TREC 2001 | NIST , 2002 .

[11]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[12]  Douglas W. Oard,et al.  The effect of bilingual term list size on dictionary-based cross-language information retrieval , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[13]  David S. Doermann,et al.  Gabor filter based multi-class classifier for scanned document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[14]  Seong-Whan Lee,et al.  Parameter-Free Geometric Document Layout Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Song Mao,et al.  Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries , 2001 .

[16]  Douglas W. Oard,et al.  CLEF Experiments at the University of Maryland: Statistical Stemming and Back-off Translation Strategies , 2000, CLEF.

[17]  Robert M. Haralick,et al.  An Optimization Methodology for Document Structure Extraction on Latin Character Documents , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  S. Bergler,et al.  Skew detection, page segmentation, and script classification of printed document images , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[19]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[21]  SchwartzRichard,et al.  Coping with ambiguity and unknown words through probabilistic models , 1993 .

[22]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[23]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[24]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[25]  Donato Malerba,et al.  Learning rules for layout analysis correction , 2001 .

[26]  John G. Daugman,et al.  Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression , 1988, IEEE Trans. Acoust. Speech Signal Process..

[27]  Azriel Rosenfeld,et al.  The function of documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[28]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[30]  Darin Stewart,et al.  Automating the Structural Markup Process in the Conversion of Print Documents to Electronic Texts , 1995, DL.

[31]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[33]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[34]  Nikos Fakotakis,et al.  Slant estimation algorithm for OCR systems , 2001, Pattern Recognit..

[35]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[36]  Douglas W. Oard,et al.  CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval , 2002, TREC.

[37]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[38]  Jan G. Wilms Computerizing a machine readable dictionary , 1990, ACM-SE 28.

[39]  Douglas W. Oard,et al.  Improved Cross-Language Retrieval using Backoff Translation , 2001, HLT.

[40]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[41]  Frank Lebourgeois,et al.  Using statistical models in docu-ment images understanding , 2001 .

[42]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Úúò Blockin Off-Line Cursive Script Recognition Based on Continuous Density HMM , 2000 .