Terminological paraphrase extraction from scientific literature based on predicate argument tuples

Terminological paraphrases (TPs) are sentences or phrases that express the concepts of terminologies in a different form. Here we propose an effective way to identify and extract TPs from large-scale scientific literature databases. We propose a novel method for effectively retrieving sentences that contain a given terminological concept based on semantic units called predicate-argument tuples. This method enables effective textual similarity computations and minimized errors based on six TP ranking models. For evaluation, we constructed an evaluation collection for the TP recognition task by extracting TPs from a target literature database using the proposed method. Through the two experiments, we learned that scientific literature contain many TPs that could not have been identified so far. Also, the experimental results showed the potential and extensibility of our proposed methods to extract the TPs.

[1]  Dan I. Moldovan,et al.  A Semantic Approach to Recognizing Textual Entailment , 2005, HLT.

[2]  Timothy Chklovski,et al.  Collecting paraphrase corpora from volunteer contributors , 2005, K-CAP '05.

[3]  Jun'ichi Tsujii,et al.  Feature Forest Models for Probabilistic HPSG Parsing , 2008, CL.

[4]  Ido Dagan,et al.  Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition , 2007, ACL-PASCAL@ACL.

[5]  Dan I. Moldovan,et al.  COGEX at RTE 3 , 2007, ACL-PASCAL@ACL.

[6]  Robert J. Gaizauskas,et al.  Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms , 2004, TREC.

[7]  Daoud Clarke Context-theoretic Semantics for Natural Language: an Overview , 2009 .

[8]  Chun Chen,et al.  Exploration of Term Dependence in Sentence Retrieval , 2007, ACL.

[9]  Günter Neumann,et al.  Recognizing Textual Entailment Using Sentence Similarity based on Dependency Tree Skeletons , 2007, ACL-PASCAL@ACL.

[10]  Ido Dagan,et al.  Directional distributional similarity for lexical inference , 2010, Natural Language Engineering.

[11]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[12]  Tat-Seng Chua,et al.  Paraphrase Recognition via Dissimilarity Significance Classification , 2006, EMNLP.

[13]  Kentaro Inui,et al.  Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , 2007, ACL 2007.

[14]  Crawford Revie,et al.  Thesaurus-enhanced search interfaces , 2002, J. Inf. Sci..

[15]  Ido Dagan,et al.  A Compact Forest for Scalable Inference over Entailment and Paraphrase Rules , 2009, EMNLP.

[16]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[17]  Udo Hahn,et al.  Finding new terminology in very large corpora , 2005, K-CAP '05.

[18]  Ralph Grishman,et al.  NOMLEX: a lexicon of nominalizations , 1998 .

[19]  Guillaume Cleuziou,et al.  Biology Based Alignments of Paraphrases for Sentence Compression , 2007, ACL-PASCAL@ACL.

[20]  Satoshi Sato,et al.  Discovery of Definition Patterns by Compressing Dictionary Sentences , 2001, NLPRS.

[21]  A R Aronson,et al.  The effect of textual variation on concept based information retrieval. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[22]  Mirella Lapata,et al.  Constructing Corpora for the Development and Evaluation of Paraphrase Systems , 2008, CL.

[23]  David Lo,et al.  Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpora , 2009, ACL.

[24]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[25]  Patrick Ruch,et al.  Evaluation of Stemming, Query Expansion and Manual Indexing Approaches for the Genomic Task , 2005, TREC.

[26]  Arthur C. Graesser,et al.  Paraphrase Identification with Lexico-Syntactic Graph Subsumption , 2008, FLAIRS.

[27]  Padmini Srinivasan,et al.  Query Expansion and MEDLINE , 1996, Inf. Process. Manag..

[28]  Zornitsa Kozareva,et al.  Paraphrase Identification on the Basis of Supervised Machine Learning Techniques , 2006, FinTAL.

[29]  Ido Dagan,et al.  Efficient Semantic Deduction and Approximate Matching over Compact Parse Forests , 2008, TAC.

[30]  R. Gläser Kyo Kageura. The Dynamics of Terminology: A Descriptive Theory of Term Formation and Terminological Growth. , 2011 .

[31]  Prodromos Malakasiotis,et al.  Paraphrase Recognition Using Machine Learning to Combine Similarity Measures , 2009, ACL.

[32]  E. N. Westerhout,et al.  Definition Extraction using Linguistic and Structural Features , 2009 .

[33]  S. T E F A N H A R M E L I N G Inferring textual entailment with a probabilistically sound calculus ∗ , 2009 .

[34]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[35]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[36]  Kun Lu,et al.  Towards effective genomic information retrieval: The impact of query complexity and expansion strategies , 2010, J. Inf. Sci..

[37]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[38]  Kyo Kageura,et al.  The Dynamics of Terminology: A descriptive theory of term formation and terminological growth , 2002 .

[39]  Stefan Thater,et al.  Assessing the impact of frame semantics on textual entailment , 2009, Natural Language Engineering.

[40]  Massimo Melucci,et al.  Symbol-Based Query Expansion Experiments at TREC 2005 Genomics Track , 2005, TREC.

[41]  Ido Dagan,et al.  Contextual Preferences , 2008, ACL.

[42]  Ayumi Shinohara,et al.  Progress in Discovery Science, Final Report of the Japanese Discovery Science Project , 2002 .

[43]  Zhiyong Lu,et al.  Evaluation of query expansion using MeSH in PubMed , 2009, Information Retrieval.

[44]  Lorenzo Dell'Arciprete,et al.  Efficient kernels for sentence pair classification , 2009, EMNLP.

[45]  Sadao Kurohashi,et al.  Discovery of Defintion Patterns by Compressing Dictionary Sentences , 2002 .

[46]  Tetsuya Ishikawa,et al.  Utilizing the World Wide Web as an Encyclopedia: Extracting Term Descriptions from Semi-Structured Texts , 2000, ACL.

[47]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[48]  Chris Brockett,et al.  Support Vector Machines for Paraphrase Identification and Corpus Construction , 2005, IJCNLP.

[49]  William R. Hersh,et al.  Assessing thesaurus-based query expansion using the UMLS Metathesaurus , 2000, AMIA.

[50]  Stephen E. Robertson,et al.  Challenges posed by web-based retrieval of scientific papers: Okapi participation in TIPS , 2002, J. Inf. Sci..

[51]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[52]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[53]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[54]  Johan Bos,et al.  Recognising Textual Entailment with Logical Inference , 2005, HLT.

[55]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.