To properly implement a simple Tibetan Information Retrieval (IR) system segmentation of one form or another (n-gram, POS-tagging, dictionary substring matching, etc.) must be performed (see Hackett (2000b)). To take Tibetan indexing to a more sophisticated level however, some form of topic detection must be employed. This paper reports the results of a pilot study on the application to Tibetan of one technique for topic boundary detection: Lexical Cohesion. The resources developed and deployed, the theoretical model used, and its potential applications are discussed. Introduction In a previous paper (Hackett, 2000b) we demonstrated a method for performing wordsegmentation in conjunction with part-of-speech tagging and sentence boundary detection. While sufficient for simple indexing and IR purposes, the assessment of larger scale structures within a text allows for more precise searching, translation equivalent disambiguation based on domain identification, and additional tagging possibilities. This paper reports the result of research deploying a method used by Kozima (1993) — “lexical cohesion” — for topic boundary detection, modified for Tibetan. Given the lack of comparable lexical resources for less-commonly studied languages like Tibetan, we exploit certain features in classical Tibetan literature, namely the literary genres of monastic textbooks (yig cha) and lists of enumerated phenomena (chos kyi rnam grangs), to build a keyword correlation database for use in computing “Lexical Cohesion Profiles” (LCP) for Tibetan texts.
[1]
Graeme Hirst,et al.
Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text
,
1991,
CL.
[2]
Michael Halliday,et al.
Cohesion in English
,
1976
.
[3]
Jeffrey C. Reynar.
An Automatic Method of Finding Topic Boundaries
,
1994,
ACL.
[4]
M Damashek,et al.
Gauging Similarity with n-Grams: Language-Independent Categorization of Text
,
1995,
Science.
[5]
Min-Yen Kan,et al.
Linear Segmentation and Segment Significance
,
1998,
VLC@COLING/ACL.
[6]
Jordan B. Pollack,et al.
Massively Parallel Parsing: A Strongly Interactive Model of Natural Language Interpretation
,
1988,
Cogn. Sci..
[7]
Alan F. Smeaton,et al.
Broadcast News Gisting Using Lexical Cohesion Analysis
,
2004,
ECIR.
[8]
David Fernández-Amorós.
WSD based on mutual information and syntactic patterns
,
2004,
SENSEVAL@ACL.
[9]
Yaakov Yaari,et al.
Segmentation of Expository Texts by Hierarchical Agglomerative Clustering
,
1997,
ArXiv.
[10]
Ruben Heradio,et al.
Automatic Word Sense Disambiguation Using Cooccurrence and Hierarchical Information
,
2010,
NLDB.