Multi-Level Topical Text Categorization with Wikipedia

This paper introduces an automatic categorical-marking model for text categorization. Traditional classification algorithms are generally applying labeled training set and call for a lot of manual work to tag classifications beforehand. Also due to the ambiguity and fuzziness of texts, the results of traditional text categorization algorithms may not be clear enough and abundant in content. This paper presents an unsupervised, training-set-free and hierarchical categorization model called Folk-Topical Text Categorization (FTTC). FTTC applies topic model to abstract documents to topical words and make use of Wikipedia's crowd-sourcing and collective control to extend hierarchical classifications. The results are not restricted to predefined categories but contain categories abstracted to deeper semantic levels and greatly facilitate traditional text categorization applications. For a document, its topical words are obtained using a popular topic model called Latent Dirichlet Allocation (LDA). Afterwards, the topical words are used to build and trace through the category-trees of Wikipedia. Based on the filtered results, the final classifications comprehensively reflect the diversified and content-rich information of the text, and fully cover different aspects of the text. Experimental results on different kinds of datasets show that our model advances in classification accuracy, flexibility and intelligibility, as compared with traditional models.

[1]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[2]  Dieter Fensel,et al.  Knowledge Engineering: Principles and Methods , 1998, Data Knowl. Eng..

[3]  Kurt Hornik,et al.  N-Gram Based Text Categorization , 2016 .

[4]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[5]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[6]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[7]  Thomas Gruber,et al.  Ontology of Folksonomy: A Mash-Up of Apples and Oranges , 2007, Int. J. Semantic Web Inf. Syst..

[8]  Sun Tie-l An approach to the text categorization of the Kazakh language based on SVM-modified KNN algorithm , 2014 .

[9]  Rada Mihalcea,et al.  Linking Documents to Encyclopedic Knowledge , 2008, IEEE Intelligent Systems.

[10]  Tang Xiao-jun,et al.  Mixture of topic model for multi-document summarization , 2014, The 26th Chinese Control and Decision Conference (2014 CCDC).

[11]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[12]  Hang Li,et al.  Text classification using ESC-based stochastic decision lists , 1999, CIKM '99.

[13]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[14]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[15]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[16]  Li Li,et al.  User-sentiment topic model: refining user's topics with sentiment information , 2012, MDS '12.

[17]  Le Sun,et al.  Smoothing LDA Model for Text Categorization , 2008, AIRS.

[18]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[19]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[20]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[21]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[22]  Arjun Mukherjee,et al.  Discovering coherent topics using general knowledge , 2013, CIKM.

[23]  Eugene Agichtein,et al.  TM-LDA: efficient online modeling of latent topic transitions in social media , 2012, KDD.

[24]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[25]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[26]  He Xiao-lian Text categorization based on resource allocating network and semantic feature selection , 2014 .

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Yaxin Bi,et al.  Using kNN model for automatic text categorization , 2006, Soft Comput..

[29]  Huidong Jin,et al.  Modelling Sequential Text with an Adaptive Topic Model , 2012, EMNLP.

[30]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[31]  References , 1971 .

[32]  Jian Su,et al.  A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection , 2011, IJCNLP.

[33]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[34]  Wei Shen,et al.  LINDEN: linking named entities with knowledge base via semantic knowledge , 2012, WWW.

[35]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[36]  Tunga Güngör,et al.  LDA-based keyword selection in text categorization , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[37]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[38]  Xue Sun,et al.  Multi-class text categorization based on LDA and SVM , 2011 .