Learning to classify short text from scientific documents using topic models with various types of knowledge

An efficient framework to classify short text from scientific documents is proposed.Topic models from various types of knowledge were used for enhancing features in documents.Two methods were presented to optimize external features that enhance relatedness in documents.The performances were evaluated by using real-world scientific documents from online publisher.Proposed methods are shown to outperform related work. Classification of short text is challenging due to data sparseness, which is a typical characteristic of short text. In this paper, we propose methods for enhancing features using topic models, which make short text seem less sparse and more topic-oriented for classification. We exploited topic model analysis based on Latent Dirichlet Allocation for enriched datasets, and then we presented new methods for enhancing features by combining external texts from topic models that make documents more effective for classification. In experiments, we utilized the title contents of scientific articles as short text documents, and then enriched these documents using topic models from various types of universal datasets for classification in order to show that our approach performs efficiently.

[1]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[2]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[3]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[4]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[5]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  John Shepherd,et al.  Document Classification via Structure Synopses , 2003, ADC.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[10]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[11]  Edward A. Fox,et al.  Intelligent GP fusion from multiple sources for text classification , 2005, CIKM '05.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[15]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[16]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[17]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[18]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[19]  Raffaella Bernardi,et al.  Query classification via Topic Models for an art image archive , 2011 .

[20]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[21]  Catherine Havasi,et al.  ConceptNet 5: A Large Semantic Network for Relational Knowledge , 2013, The People's Web Meets NLP.

[22]  Gang Liu,et al.  Short text similarity based on probabilistic topics , 2009, Knowledge and Information Systems.

[23]  Michael W. Berry,et al.  Large-Scale Information Retrieval with Latent Semantic Indexing , 1997, Inf. Sci..

[24]  Oren Etzioni,et al.  A Latent Dirichlet Allocation Method for Selectional Preferences , 2010, ACL.

[25]  William John Teahan,et al.  Context-based methods for text categorisation , 2004, SIGIR '04.

[26]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[27]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Chengqi Zhang,et al.  TCSST: transfer classification of short & sparse text using external data , 2012, CIKM.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[31]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[32]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[33]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[34]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[35]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[36]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[37]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[38]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.