KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks

As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.

[1]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[2]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[3]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[4]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[5]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[8]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[9]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[10]  Xiaojun Wan,et al.  The earth mover's distance as a semantic measure for document similarity , 2005, CIKM '05.

[11]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[12]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[13]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[14]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[15]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[16]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[17]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[18]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[19]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[20]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[21]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[22]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[23]  Shimei Pan,et al.  Topic and keyword re-ranking for LDA-based topic modeling , 2009, CIKM.

[24]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[25]  Ni Lao,et al.  Relational retrieval using a combination of path-constrained random walks , 2010, Machine Learning.

[26]  Joan A. Smith,et al.  Robust , Light-weight Approaches to compute Lexical Similarity , 2010 .

[27]  Philip S. Yu,et al.  Mining knowledge from databases: an information network analysis approach , 2010, SIGMOD Conference.

[28]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[29]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[30]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[31]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[32]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[33]  Ming Zhou,et al.  Paraphrasing Adaptation for Web Search Ranking , 2013, ACL.

[34]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[35]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[36]  Dan Roth,et al.  Unsupervised Sparse Vector Densification for Short Text Similarity , 2015, NAACL.

[37]  Haixun Wang,et al.  Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach , 2015, IJCAI.

[38]  Dan Roth,et al.  Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks , 2015, KDD.

[39]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.