The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion for Tibetan Topic Boundary Detection

To properly implement a simple Tibetan Information Retrieval (IR) system segmentation of one form or another (n-gram, POS-tagging, dictionary substring matching, etc.) must be performed (see Hackett (2000b)). To take Tibetan indexing to a more sophisticated level however, some form of topic detection must be employed. This paper reports the results of a pilot study on the application to Tibetan of one technique for topic boundary detection: Lexical Cohesion. The resources developed and deployed, the theoretical model used, and its potential applications are discussed. Introduction In a previous paper (Hackett, 2000b) we demonstrated a method for performing wordsegmentation in conjunction with part-of-speech tagging and sentence boundary detection. While sufficient for simple indexing and IR purposes, the assessment of larger scale structures within a text allows for more precise searching, translation equivalent disambiguation based on domain identification, and additional tagging possibilities. This paper reports the result of research deploying a method used by Kozima (1993) — “lexical cohesion” — for topic boundary detection, modified for Tibetan. Given the lack of comparable lexical resources for less-commonly studied languages like Tibetan, we exploit certain features in classical Tibetan literature, namely the literary genres of monastic textbooks (yig cha) and lists of enumerated phenomena (chos kyi rnam grangs), to build a keyword correlation database for use in computing “Lexical Cohesion Profiles” (LCP) for Tibetan texts.