World Knowledge as Indirect Supervision for Document Clustering

One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then, the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this article, we provide an example of using world knowledge for domain-dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then, we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features. A preliminary version of this work appeared in the proceedings of KDD 2015 [Wang et al. 2015a]. This journal version has made several major improvements. First, we have proposed a new and general learning framework for machine learning with world knowledge as indirect supervision, where document clustering is a special case in the original paper. Second, in order to make our unsupervised semantic parsing method more understandable, we add several real cases from the original sentences to the resulting logic forms with all the necessary information. Third, we add details of the three semantic filtering methods and conduct deep analysis of the three semantic filters, by using case studies to show why the conceptualization-based semantic filter can produce more accurate indirect supervision. Finally, in addition to the experiment on 20 newsgroup data and Freebase, we have extended the experiments on clustering results by using all the combinations of text (20 newsgroup, MCAT, CCAT, ECAT) and world knowledge sources (Freebase, YAGO2).

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Heng Ji,et al.  Constrained Information-Theoretic Tripartite Graph Clustering to Identify Semantically Similar Relations , 2015, IJCAI.

[3]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[5]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[6]  Jiawei Han,et al.  Text Classification with Heterogeneous Information Network Kernels , 2016, AAAI.

[7]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[8]  Rina Dechter,et al.  Mixtures of Deterministic-Probabilistic Networks and their AND/OR Search Space , 2004, UAI.

[9]  Han Jiawei,et al.  KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks , 2015 .

[10]  Raymond J. Mooney,et al.  Mapping and Revising Markov Logic Networks for Transfer Learning , 2007, AAAI.

[11]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[12]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[13]  Shijie Zhang,et al.  TreePi: A Novel Graph Indexing Method , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Nan Liu,et al.  Knowledge Acquisition and Representation Using Fuzzy Evidential Reasoning and Dynamic Adaptive Fuzzy Petri Nets , 2013, IEEE Transactions on Cybernetics.

[15]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[16]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[17]  Ming Zhou,et al.  Paraphrasing Adaptation for Web Search Ranking , 2013, ACL.

[18]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[19]  Qiang Yang,et al.  Source Free Transfer Learning for Text Classification , 2014, AAAI.

[20]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[21]  Dan Roth,et al.  The Importance of Syntactic Parsing and Inference in Semantic Role Labeling , 2008, CL.

[22]  Hui Xiong,et al.  Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[23]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[24]  Haixun Wang,et al.  Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees , 2015, IEEE Transactions on Knowledge and Data Engineering.

[25]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[26]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[27]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[28]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[29]  Zhen Wang,et al.  Knowledge Graph and Text Jointly Embedding , 2014, EMNLP.

[30]  Haixun Wang,et al.  Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach , 2015, IJCAI.

[31]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[32]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[33]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[34]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[35]  Zhoujun Li,et al.  Concept-based Short Text Classification and Ranking , 2014, CIKM.

[36]  Philip S. Yu,et al.  Mining knowledge from databases: an information network analysis approach , 2010, SIGMOD Conference.

[37]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[38]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[39]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[40]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[41]  Mark Steedman,et al.  Large-scale Semantic Parsing without Question-Answer Pairs , 2014, TACL.

[42]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[43]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[44]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[45]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[46]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[47]  Qiang Yang,et al.  Self-taught clustering , 2008, ICML '08.

[48]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[49]  Philip S. Yu,et al.  Inferring anchor links across multiple heterogeneous social networks , 2013, CIKM.

[50]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[51]  Furu Wei,et al.  Constrained Text Coclustering with Supervised and Unsupervised Constraints , 2013, IEEE Transactions on Knowledge and Data Engineering.

[52]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[53]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[54]  Pascal Denis,et al.  Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming , 2007, NAACL.

[55]  Eric Eaton,et al.  ELLA: An Efficient Lifelong Learning Algorithm , 2013, ICML.

[56]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[57]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[58]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[59]  Hui Xiong,et al.  Understanding and Enhancement of Internal Clustering Validation Measures , 2013, IEEE Transactions on Cybernetics.

[60]  Philip S. Yu,et al.  Predicting Social Links for New Users across Aligned Heterogeneous Social Networks , 2013, 2013 IEEE 13th International Conference on Data Mining.

[61]  Qiang Yang,et al.  Spectral domain-transfer learning , 2008, KDD.

[62]  Haixun Wang,et al.  Identifying users' topical tasks in web search , 2013, WSDM.

[63]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[64]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[65]  D. Roth 1 Global Inference for Entity and Relation Identification via a Linear Programming Formulation , 2007 .

[66]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[67]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[68]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[69]  Eunsol Choi,et al.  Scaling Semantic Parsers with On-the-Fly Ontology Matching , 2013, EMNLP.

[70]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[71]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[72]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[73]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[74]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[75]  Zhongqi Lu,et al.  Selective Transfer Learning for Cross Domain Recommendation , 2012, SDM.

[76]  Ming-Wei Chang,et al.  Structured learning with constrained conditional models , 2012, Machine Learning.

[77]  Guillaume Bouchard,et al.  Convex Collective Matrix Factorization , 2013, AISTATS.

[78]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[79]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[80]  Changshui Zhang,et al.  Knowledge Transfer on Hybrid Graph , 2009, IJCAI.

[81]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[82]  Tom M. Mitchell,et al.  Weakly Supervised Training of Semantic Parsers , 2012, EMNLP.

[83]  Mark Steedman,et al.  Lexical Generalization in CCG Grammar Induction for Semantic Parsing , 2011, EMNLP.

[84]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[85]  L. Getoor,et al.  1 Global Inference for Entity and Relation Identification via a Linear Programming Formulation , 2007 .

[86]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[87]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[88]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[89]  Haixun Wang,et al.  Transfer Understanding from Head Queries to Tail Queries , 2014, CIKM.

[90]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[91]  Philip S. Yu,et al.  Transferring heterogeneous links across location-based social networks , 2014, WSDM.

[92]  Yizhou Sun,et al.  RelSim: Relation Similarity Search in Schema-Rich Heterogeneous Information Networks , 2016, SDM.

[93]  Philip S. Yu,et al.  A probabilistic framework for relational clustering , 2007, KDD '07.

[94]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[95]  Ming-Wei Chang,et al.  Unified Expectation Maximization , 2012, NAACL.

[96]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[97]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[98]  Dan Roth,et al.  Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks , 2015, KDD.

[99]  Guillaume Bouchard,et al.  Group-sparse Embeddings in Collective Matrix Factorization , 2013, ICLR.

[100]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[101]  Mirella Lapata,et al.  Constraint-Based Sentence Compression: An Integer Programming Approach , 2006, ACL.

[102]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[103]  Qiang Yang,et al.  Lifelong Machine Learning Test , 2015, AAAI 2015.

[104]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[105]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[106]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[107]  Jiawei Han,et al.  KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks , 2015, 2015 IEEE International Conference on Data Mining.

[108]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[109]  Percy Liang,et al.  Lambda Dependency-Based Compositional Semantics , 2013, ArXiv.

[110]  Raymond J. Mooney,et al.  Learning for Semantic Parsing , 2009, CICLing.

[111]  Alexander Yates,et al.  Large-scale Semantic Parsing via Schema Matching and Lexicon Extension , 2013, ACL.

[112]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.