An NLP & IR approach to topic detection

This paper presents algorithms for Chinese and English-Chinese topic detection. Named entities, other nouns and verbs are cue patterns to relate news stories describing the same event. Lexical translation and name transliteration resolve lexical differences between English and Chinese. A two-threshold scheme determines relevance (irrelevance) between a news story and a topic cluster. Lookahead information deals with ambiguous cases in clustering. The least-recently-used removal strategy models the time factor in such a way that older and unimportant terms will have no effect on clustering. Experimental results show that nouns and verbs as well as the least-recently-used removal strategy outperform other models. The performance of the named-entity-only approach decreases slightly, but it has no overhead of nouns-and-verbs approach with the least-recently-used removal strategy.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Hsin-Hsi Chen,et al.  Cross-language information access to multilingual collections on the internet , 2000 .

[3]  Hsin-Hsi Chen,et al.  Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval , 1999, ACL.

[4]  Hsin-Hsi Chen,et al.  Proper Name Translation in Cross-Language Information Retrieval , 1998, COLING-ACL.

[5]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[6]  Hsin-Hsi Chen,et al.  A Muitilingual News Summarizer , 2000, COLING.

[7]  Hsin-Hsi Chen,et al.  Identification and Classification of Proper Nouns in Chinese Texts , 1996, COLING.

[8]  Hsin-Hsi Chen,et al.  反向異文字音譯相似度評量方法與跨語言資訊檢索 (Similarity Measure in Backward Transliteration between Different Character Sets and Its Application to CLIR) [In Chinese] , 2000, ROCLING/IJCLCLP.

[9]  Hsin-Hsi Chen,et al.  A summarization system for Chinese news from multiple sources , 2003, J. Assoc. Inf. Sci. Technol..

[10]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[11]  Miguel E. Ruiz,et al.  CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation , 1999, TREC.

[12]  Takenobu Tokunaga,et al.  The Use of WordNet in Information Retrieval , 1998, WordNet@ACL/COLING.

[13]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[14]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[15]  Hsin-Hsi Chen,et al.  Description of the NTU System used for MET-2 , 1998, MUC.