Constructing and Mining Heterogeneous Information Networks from Massive Text

Real-world data exists largely in the form of unstructured texts. A grand challenge on data mining research is to develop effective and scalable methods that may transform unstructured text into structured knowledge. Based on our vision, it is highly beneficial to transform such text into structured heterogeneous information networks, on which actionable knowledge can be generated based on the user's need. In this tutorial, we provide a comprehensive overview on recent research and development in this direction. First, we introduce a series of effective methods that construct heterogeneous information networks from massive, domain-specific text corpora. Then we discuss methods that mine such text-rich networks based on the user's need. Specifically, we focus on scalable, effective, weakly supervised, language-agnostic methods that work on various kinds of text. We further demonstrate, on real datasets (including news articles, scientific publications, and product reviews), how information networks can be constructed and how they can assist further exploratory analysis.

[1]  Danai Koutra,et al.  Graph Summarization Methods and Applications , 2016, ACM Comput. Surv..

[2]  Teng Ren,et al.  Learning Named Entity Tagger using Domain-Specific Dictionary , 2018, EMNLP.

[3]  Jiawei Han,et al.  Multi-Dimensional, Phrase-Based Summarization in Text Cubes , 2016, IEEE Data Eng. Bull..

[4]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[5]  Philippe Cudré-Mauroux,et al.  Are Meta-Paths Necessary?: Revisiting Heterogeneous Graph Embeddings , 2018, CIKM.

[6]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Lance M. Kaplan,et al.  AutoNet : Automated Network Construction and Exploration System from Domain-Specific Corpora , 2018 .

[9]  Brian M. Sadler,et al.  TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering , 2018, KDD.

[10]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[11]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[12]  Jiawei Han,et al.  KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles , 2013, ArXiv.

[13]  Jiawei Han,et al.  End-to-End Reinforcement Learning for Automatic Taxonomy Induction , 2018, ACL.

[14]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[17]  Xiang Ren,et al.  Empower Sequence Labeling with Task-Aware Neural Language Model , 2017, AAAI.

[18]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[19]  Wanxiang Che,et al.  Learning Semantic Hierarchies via Word Embeddings , 2014, ACL.

[20]  Nitesh V. Chawla,et al.  metapath2vec: Scalable Representation Learning for Heterogeneous Networks , 2017, KDD.

[21]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[22]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[23]  Bin Wang,et al.  Efficiently Mining High Quality Phrases from Texts , 2017, AAAI.

[24]  Brian M. Sadler,et al.  HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion , 2018, KDD.

[25]  Christopher Ré,et al.  SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data , 2017, ArXiv.

[26]  Jiawei Han,et al.  Meta-Path Guided Embedding for Similarity Search in Large-Scale Heterogeneous Information Networks , 2016, ArXiv.

[27]  Brian M. Sadler,et al.  TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering , 2018, KDD 2018.

[28]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[29]  Jiawei Han,et al.  Doc2Cube: Allocating Documents to Text Cube Without Labeled Data , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[30]  Jiawei Han,et al.  SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble , 2017, ECML/PKDD.

[31]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.