HierCon: Hierarchical Organization of Technical Documents Based on Concepts

In this work we study the hierarchical organization of technical documents, where given a set of documents and a hierarchy of categories, the goal is to assign documents to their corresponding categories. Unlike prior work on supervised hierarchical document categorization that relies on large amount of labeled training data, which is expensive to obtain in closed technical domain and tends to stale as new knowledge emerges, we study this problem in a weak supervision setting, by leveraging semantic information from concepts. The core idea is to project both documents and categories into a common concept embedding space, where their fine-grained similarity can be easily and effectively computed. Experiments over real-world datasets from the subject of computer science, physics & mathematics, and medicine demonstrated the superior performance of our approach over a wide range of state of the art baseline approaches.

[1]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[2]  Venkatesh Saligrama,et al.  Learning Joint Feature Adaptation for Zero-Shot Recognition , 2016, ArXiv.

[3]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[4]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[5]  Tie-Yan Liu,et al.  Word-Entity Duet Representations for Document Ranking , 2017, SIGIR.

[6]  James P. Callan,et al.  Query Expansion with Freebase , 2015, ICTIR.

[7]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[8]  Laura A. Sill,et al.  Introduction to Cataloging and Classification , 2007 .

[9]  Xifeng Yan,et al.  Concept Mining via Embedding , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[10]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[13]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[14]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[15]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[16]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[17]  Xifeng Yan,et al.  Unsupervised Neural Categorization for Scientific Publications , 2018, SDM.

[18]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[19]  Jiawei Han,et al.  Entity Set Search of Scientific Literature: An Unsupervised Ranking Approach , 2018, SIGIR.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dat Quoc Nguyen An overview of embedding models of entities and relationships for knowledge base completion , 2017, ArXiv.

[24]  James P. Callan,et al.  EsdRank: Connecting Query and Documents through External Semi-Structured Data , 2015, CIKM.

[25]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[27]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[28]  Yiming Yang,et al.  Recursive regularization for large-scale classification with hierarchical and graphical dependencies , 2013, KDD.

[29]  Koen Lamberts,et al.  Knowledge, Concepts, and Categories , 1997 .

[30]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[31]  James Allan,et al.  Entity query feature expansion using knowledge base links , 2014, SIGIR.

[32]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[33]  Lin Xiao,et al.  Hierarchical Classification via Orthogonal Transfer , 2011, ICML.

[34]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[35]  Jiawei Han,et al.  Weakly-Supervised Hierarchical Text Classification , 2018, AAAI.

[36]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[37]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[39]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[41]  Yeye He,et al.  Discovering Enterprise Concepts Using Spreadsheet Tables , 2017, KDD.

[42]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[43]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Georgios Paliouras,et al.  Evaluation measures for hierarchical classification: a unified view and novel approaches , 2013, Data Mining and Knowledge Discovery.

[45]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[46]  Ming-Wei Chang,et al.  ERD'14 , 2014, SIGIR Forum.

[47]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.