An Automatic Multi-domain Thesauri Construction Method Based on LDA

This paper proposed a method for building domain-specific thesauri automatically from plain text corpus based on Latent Dirichlet Allocation (LDA). This method consists of two steps: 1) discovering domain-specific terms from document collections of multiple domains, and 2) learning hierarchical relations between the associated terms of each domain. The novelty of step 1 lies in the utilization of LDA in selecting terms with high predictive probability of a specific domain via latent topics, which overcomes the drawbacks of unigram model. Meanwhile, the hierarchical relations among domain terms are exploited by a novel approach based on word association analysis in step 2. The proposed method is tested on two datasets in different languages. The experimental results show that the terms obtained by this method are intuitively relevant to the reference domain and many term pairs with hierarchical relations are discovered. And the relations reflect the structure of the domain rather well. Compared to other approaches, the proposed one is more accurate in both domain terms mining and hierarchical relation learning tasks.

[1]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[2]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[3]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[4]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[5]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[6]  Yuen-Hsien Tseng,et al.  Automatic thesaurus generation for Chinese documents , 2002, J. Assoc. Inf. Sci. Technol..

[7]  Wlodzimierz Drabent,et al.  Extending XML Query Language Xcerpt by Ontology Queries , 2007 .

[8]  Ulrich Thiel,et al.  Language Modeling for Effective Construction of Domain Specific Thesauri , 2004, NLDB.

[9]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[10]  Wolf-Tilo Balke,et al.  The Semantic GrowBag Algorithm: Automatically Deriving Categorization Systems , 2007, ECDL.

[11]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[12]  George A. Vouros,et al.  Discovering Subsumption Hierarchies of Ontology Concepts from Text Corpora , 2007 .

[13]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[14]  Andrzej Bargiela,et al.  Probabilistic Topic Models for Learning Terminological Ontologies , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[16]  Takahiro Hara,et al.  Association thesaurus construction methods based on link co-occurrence analysis for wikipedia , 2008, CIKM '08.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[19]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .