Improving Open Directory Project-Based Text Classification with Hierarchical Category Embedding

Many works have used knowledge bases that contain taxonomy of hierarchically structured categories for large-scale text classification. These works have utilized hierarchical taxonomies based on the explicit representation model. They demonstrated that the explicit representation model provides a stable performance for large-scale text classification. However, this performance is limited to the knowledge base. In this paper, we integrate the implicit representation model, which has the ability to use external knowledge indirectly, with previous large-scale text classification. To this end, we first propose Hierarchical Category embedding (HC embedding) to generate distributed representations of hierarchical categories based on the implicit representation model. Second, we develop a new semantic similarity method to integrate HC embedding with the large-scale text classification. To demonstrate efficacy, we apply the proposed methodology to Open Directory Project (ODP)-based text classification, which has a hierarchical taxonomy. The evaluation results demonstrate that the proposed method outperforms the current state-of-the-art method by 7.4 %, 7.0 %, and 18 % in terms of micro-averaging F1-score, macro-averaging F1-score, and precision at k, respectively.

[1]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[2]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Jin Wang,et al.  Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification , 2017, IJCAI.

[5]  James L. McClelland,et al.  James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987. , 1989, Journal of Child Language.

[6]  Yifan Chen,et al.  Advertising keyword suggestion based on concept hierarchy , 2008, WSDM '08.

[7]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[8]  Houfeng Wang,et al.  Attentive Interactive Neural Networks for Answer Selection in Community Question Answering , 2017, AAAI.

[9]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[10]  Haixun Wang,et al.  Understanding Short Texts , 2013, APWeb.

[11]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[14]  Andrei Z. Broder,et al.  A semantic approach to contextual advertising , 2007, SIGIR.

[15]  SangKeun Lee,et al.  Toward robust classification using the Open Directory Project , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[16]  SangKeun Lee,et al.  Utilizing Wikipedia knowledge in open directory project-based text classification , 2017, SAC.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Bowen Zhou,et al.  Attentive Pooling Networks , 2016, ArXiv.

[19]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[20]  Rui Zhang,et al.  Dependency Sensitive Convolutional Neural Networks for Modeling Sentences and Documents , 2016, NAACL.

[21]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[22]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[23]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[24]  SangKeun Lee,et al.  Semantic contextual advertising based on the open directory project , 2013, TWEB.

[25]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.