论文信息 - CLTC: A Chinese-English Cross-lingual Topic Corpus

CLTC: A Chinese-English Cross-lingual Topic Corpus

Cross-lingual topic detection within text is a feasible solution to resolving the language barrier in accessing the information. This paper presents a Chinese-English cross-lingual topic corpus (CLTC), in which 90,000 Chinese articles and 90,000 English articles are organized within 150 topics. Compared with TDT corpora, CLTC has three advantages. First, CLTC is bigger in size. This makes it possible to evaluate the large-scale cross-lingual text clustering methods. Second, articles are evenly distributed within the topics. Thus it can be used to produce test datasets for different purposes. Third, CLTC can be used as a cross-lingual comparable corpus to develop methods for cross-lingual information access. A preliminary evaluation with CLTC corpus indicates that the corpus is effective in evaluating cross-lingual topic detection methods.

Peng Jin | Yunqing Xia | Xia Yang | Guoyu Tang

[1] Qiang Dong,et al. Hownet And The Computation Of Meaning , 2006 .

[2] Duo Ding. Integrate Multilingual Web Search Results using Cross-Lingual Topic Models , 2011 .

[3] Haizhou Li,et al. CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering , 2011, IJCNLP.

[4] Sobha Lalitha Devi,et al. How to Get the Same News from Different Language News Papers , 2010 .

[5] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[6] Mark Liberman,et al. THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[7] Sobha Lalitha Devi,et al. How to Get the Same News from Different Language News Papers , 2010, Proceedings of the 4th Workshop on Cross Lingual Information Access.

[8] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[9] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[10] Ying Zhang,et al. Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented With Dictionaries Mined from Wikipedia , 2008, IJCNLP.

[11] Pushpak Bhattacharyya,et al. Exploiting Semantic Proximity for Information Retrieval , 2006 .

[12] G. Karypis,et al. Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .