Information extraction for Thai documents

An increasing amount of electronically available information is stored in Asian language documents, which makes Information Retrieval (IR) and Information Extraction (IE) for these languages important for a large number of users. Analysis and extraction of information in these languages presents several interesting problems not seen in Western European languages; these are interesting in their own right and for the insights they can give into more general IR and IE techniques. We describe these problems and our system for Thai language IE One of the main concerns when working with Thai natural language is that the structure of the language itself is highly ambiguous. The analyser therefore requires more sophisticated techniques and large amounts of domain knowledge to cope with these ambiguities. We describe our approach to a natural language analysis system that performs preprocessing for the Thai language and the extraction module to retrieve specific information according to the predefined concept definitions.

[1]  Toru Matsuda,et al.  Overlapping statistical word indexing: a new indexing method for Japanese text , 1997, SIGIR '97.

[2]  Virach Sornlertlamvanich,et al.  Information-Based Language Analysis For Thai , 1993 .

[3]  Keh-Jiann Chen,et al.  Unknown Word Detection for Chinese by a Corpus-based Learning Method , 1998, ROCLING/IJCLCLP.

[4]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[5]  Somchai Prasitjutrakul,et al.  Automatic Indexing for Thai Text with Unknown Words using Trie Structure , 1997 .

[6]  Fabio Ciravegna,et al.  Integrating Shallow and Linguistic Techniques for Information Extraction from Text , 1995, AI*IA.

[7]  Doug Cooper,et al.  How to read less and know more: approximate OCR for Thai , 1997, SIGIR '97.

[8]  Robert Rienow,et al.  Introduction to government , 1953 .

[9]  Kyo Kageura,et al.  Phrase processing methods for Japanese text retrieval , 1998, SIGF.

[10]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[13]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[14]  A. Kawtrakul,et al.  Backward transliteration for Thai document retrieval , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[15]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[16]  Jae-Hoon Kim Probabilistic Parsing of Korean Sentences Using Collocational Information , 1997 .

[17]  Gregory Grefenstette,et al.  Regular expressions for language engineering , 1996, Natural Language Engineering.

[18]  Claire Cardie,et al.  Evaluating an Information Extraction System , 1994 .

[19]  A. Kawtrakul,et al.  Towards automatic multilevel indexing for Thai text information retrieval , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[20]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[21]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .