Using Semantic Perimeters with Ontologies to Evaluate the Semantic Similarity of Scientific Papers

The work presented in this paper deals with the use of ontologies to compare scientific texts. It particularly deals with scientific papers, specifically their abstracts, short texts that are relatively well structured and normally provide enough knowledge to allow a community of readers to assess the content of the associated scientific papers. The problem is, therefore, to determine how to assess the semantic proximity/similarity of two papers by examining their respective abstracts. Given that a domain ontology provides a useful way to represent knowledge relative to a given domain, this work considers ontologies relative to scientific domains. Our process begins by defining the relevant domain for an abstract through an automatic classification that makes it possible to associate this abstract to its relevant scientific domain, chosen from several candidate domains. The content of an abstract is represented in the form of a conceptual graph which is enriched to construct its semantic perimeter. As presented below, this notion of semantic perimeter usefully allows us to assess the similarity between the texts by matching their graphs. Detecting plagiarism is the main application field addressed in this paper, among the many possible application fields of our approach. Povzetek: Delo v tem prispevku obravnava uporabo ontologij za primerjavo znanstvenih besedil. Odkrivanje plagiacije je glavno podrocje uporabe, obravnavano v tem dokumentu, med mnogimi možnimi podrocji uporabe nasega pristopa.

[1]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[2]  Rokia Bendaoud,et al.  Analyses formelle et relationnelle de concepts pour la construction d'ontologies de domaines à partir de ressources textuelles hétérogènes , 2009 .

[3]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[4]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[5]  Torsten Suel,et al.  Structural Sentence Similarity Estimation for Short Texts , 2016, The Florida AI Research Society.

[6]  Hmway Hmway Tar,et al.  Ontology-based Concept Weighting for Text Documents , 2011 .

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Andrei Popescu-Belis,et al.  Computing text semantic relatedness using the contents and links of a hypertext encyclopedia , 2013, Artif. Intell..

[9]  Emanuele Caglioti,et al.  A plagiarism detection procedure in three steps: Selection, matches and squares , 2009 .

[10]  Mohand Boughanem,et al.  A fuzzy set approach to concept-based information retrieval , 2005, EUSFLAT Conf..

[11]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[12]  Naomie Salim,et al.  CONCEPTUAL SIMILARITY AND GRAPH-BASED METHOD FOR PLAGIARISM DETECTION , 2011 .

[13]  Dunja Mladenic,et al.  Semantic Graphs Derived From Triplets with Application in Document Summarization , 2009, Informatica.

[14]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[15]  Roberto Navigli,et al.  From senses to texts: An all-in-one graph-based approach for measuring semantic similarity , 2015, Artif. Intell..

[16]  Anne Laurent,et al.  Sequential patterns for text categorization , 2006, Intell. Data Anal..

[17]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[18]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[19]  James Lewis,et al.  Data and text mining Text similarity : an alternative way to search MEDLINE , 2006 .

[20]  N. H. N. D. de Silva,et al.  Sentence similarity measuring by vector space model , 2014, 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer).

[21]  Jean-Philippe Cointet,et al.  Argumentative analysis of the ACL Anthology (Analyse argumentative du corpus de l’ACL (ACL Anthology)) [in French] , 2014, JEP/TALN/RECITAL.

[22]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[23]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[24]  Janis Grundspenkis,et al.  Computer-based plagiarism detection methods and tools: an overview , 2007, CompSysTech '07.

[25]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[26]  Dinesh U Acharya,et al.  SEMANTIC PLAGIARISM DETECTION SYSTEM USING ONTOLOGY MAPPING , 2012 .

[27]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.

[28]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[29]  Daniel C. Howe,et al.  RiTa: creativity support for computational literature , 2009, C&C '09.

[30]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[31]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[32]  Thierry Poibeau,et al.  A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents , 2011, EMNLP.

[33]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[34]  Catherine Comparot,et al.  Using Domain Ontologies for Classification and Semantic Interpretation of Documents , 2016, Big Data 2016.

[35]  Deepa Gupta,et al.  Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[36]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[37]  Simone Paolo Ponzetto,et al.  Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[38]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[39]  Lu Zhang,et al.  Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge , 2011 .

[40]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[41]  Yang Liu,et al.  Computing Semantic Text Similarity Using Rich Features , 2015, PACLIC.

[42]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[43]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[44]  Peng Wang,et al.  Short Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph , 2014, NLPCC.

[45]  Emanuele Caglioti,et al.  An example of mathematical authorship attribution , 2008 .

[46]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[47]  Rohini K. Srihari,et al.  Graph-based text representation and knowledge discovery , 2007, SAC '07.

[48]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[49]  Yoshiyuki Takeda,et al.  Dynamic programming matching for large scale information retrieval , 2003, IRAL.

[50]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[51]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[52]  Edward Fox,et al.  Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types , 1983 .

[53]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[54]  John Shawe-Taylor,et al.  Semantic text features from small world graphs , 2005 .

[55]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[56]  Guy W. Mineau,et al.  A simple KNN algorithm for text categorization , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[57]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[58]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[59]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[60]  Carolyn J. Crouch,et al.  Using the Extended Vector Model for XML Retrieval , 2002, INEX Workshop.

[61]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[62]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .