Dragon Toolkit: Incorporating Auto-Learned Semantic Knowledge into Large-Scale Text Retrieval and Mining

The majority of text retrieval and mining techniques are still based on exact feature (e.g. words) matching and unable to incorporate text semantics. Many researchers believe that the extension with semantic knowledge could improve the results and various methods (most of them are heuristic) have been proposed to account for concept hierarchy, synonymy, and other semantic relationships. However, the results with such semantic extension have been mixed, ranging from slight improvements to decreases in effectiveness, mostly likely due to the lack of a formal framework. Instead, we propose a novel method to address the semantic extension within the framework of language modeling. Our method extracts explicit topic signatures from documents and then statistically maps them into single- word features. The incorporation of semantic knowledge then reduces to the smoothing of unigram language models using semantic knowledge. The dragon toolkit reflects our method and its effectiveness is demonstrated by three tasks, text retrieval, text classification, and text clustering.

[1]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[5]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[6]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[7]  Clement T. Yu,et al.  Word sense disambiguation in queries , 2005, CIKM '05.

[8]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[9]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[12]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[13]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[14]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[15]  Michael K. Ng,et al.  A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering , 2007, DASFAA.

[16]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[17]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[18]  Xiaohua Hu,et al.  MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup , 2006, PRICAI.

[19]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[20]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[21]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.