Cross-language article linking with different knowledge bases using bilingual topic model and translation features

Creating links among online encyclopedia articles in different languages is crucial in the construction and integration of large multilingual knowledge bases. Most research to date has focused on linking among different language versions of Wikipedia, yet other large online encyclopedias in a variety of languages exist. In this work, we present a cross-language article-linking method using a bilingual topic model and translation features based on an SVM model to link articles in English Wikipedia and Chinese Baidu Baike, the most widely used Wiki-like encyclopedia in China. To evaluate our approach, we compile data sets from Baidu Baike articles and their corresponding English Wikipedia articles. The evaluation results show that our approach achieves at most 0.8158 in MRR, outperforming the baseline system by 0.1328 (+19.44%) in MRR. Our method does not heavily depend on linguistic characteristics, and it can be easily extended to generate cross-language article links among different online encyclopedias in other languages.

[1]  Jong-Hoon Oh,et al.  Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[2]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[3]  Min Jiang,et al.  The business and politics of search engines: A comparative study of Baidu and Google’s search results of Internet events in China , 2012, New Media Soc..

[4]  Jian Su,et al.  I2R-NUS-MSRA at TAC 2011: Entity Linking , 2011, TAC.

[5]  Douglas W. Oard,et al.  Cross-Language Entity Linking , 2011, IJCNLP.

[6]  Seung-won Hwang,et al.  Bootstrapping Entity Translation on Weakly Comparable Corpora , 2013, ACL.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[9]  James Clarke,et al.  Basis Technology at TAC 2012 Entity Linking , 2012, TAC.

[10]  Djoerd Hiemstra,et al.  WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia , 2008, CLEF.

[11]  Han-Teng Liao How does localization influence online visibility of user-generated encyclopedias?: a study on Chinese-language search engine result pages (SERPs) , 2013, OpenSym.

[12]  Yao Meng,et al.  FRDC's Cross-lingual Entity Linking System at TAC 2013 , 2013, TAC.

[13]  Michael Strube,et al.  HITS' Monolingual and Cross-lingual Entity Linking System at TAC 2012: A Joint Approach , 2012, TAC.

[14]  Juliana Freire,et al.  Multilingual Schema Matching for Wikipedia Infoboxes , 2011, Proc. VLDB Endow..

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  James R. Curran,et al.  Graph-Based Named Entity Linking with Wikipedia , 2011, WISE.

[17]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[18]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[19]  Wanxiang Che,et al.  HIT Approaches to Entity Linking at TAC 2011 , 2011, TAC.

[20]  Tao Zhang,et al.  Cross Lingual Entity Linking with Bilingual Topic Model , 2013, IJCAI.