Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

: Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.

[1]  Xiaowei Zhu,et al.  A Comparative Analysis on Weibo and Twitter , 2016 .

[2]  Jing Wang,et al.  Unsupervised feature selection through Gram-Schmidt orthogonalization - A word co-occurrence perspective , 2016, Neurocomputing.

[3]  Graeme Hirst,et al.  Encoding Distributional Semantics into Triple-Based Knowledge Ranking for Document Enrichment , 2015, ACL.

[4]  Kevin Duh,et al.  Incorporating Both Distributional and Relational Semantics in Word Representations , 2015, ICLR.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Ting Liu,et al.  Triple based Background Knowledge Ranking for Document Enrichment , 2014, COLING.

[7]  Shunxiang Zhang,et al.  Mining temporal explicit and implicit semantic relations between entities using web search engines , 2014, Future Gener. Comput. Syst..

[8]  Mark Dredze,et al.  Improving Lexical Embeddings with Semantic Knowledge , 2014, ACL.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[11]  Xiangfeng Luo,et al.  Measuring the semantic discrimination capability of association relations , 2014, Concurr. Comput. Pract. Exp..

[12]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[13]  Vasile Rus,et al.  SEMILAR: The Semantic Similarity Toolkit , 2013, ACL.

[14]  Vasile Rus,et al.  Similarity Measures Based on Latent Dirichlet Allocation , 2013, CICLing.

[15]  Muhammad Rafi,et al.  An improved semantic similarity measure for document clustering based on topic maps , 2013, ArXiv.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[18]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[19]  Nobal B. Niraula,et al.  The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts , 2012 .

[20]  Jie Yu,et al.  Measuring semantic similarity between words by removing noise and redundancy in web snippets , 2011, Concurr. Comput. Pract. Exp..

[21]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[22]  Christopher D. Manning,et al.  Random Walks for Text Semantic Similarity , 2009, Graph-based Methods for Natural Language Processing.

[23]  Arthur C. Graesser,et al.  Assessing Student Paraphrases Using Lexical Semantics and Word Weighting , 2009, AIED.

[24]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[25]  Kincho H. Law,et al.  Utilizing Statistical Semantic Similarity Techniques for Ontology Mapping — with Applications to AEC Standard Models , 2008 .

[26]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[27]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[28]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[29]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[30]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[31]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[36]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[37]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.