Replicating Relevance-Ranked Synonym Discovery in a New Language and Domain

Domain-specific synonyms occur in many specialized search tasks, such as when searching medical documents, legal documents, and software engineering artifacts. We replicate prior work on ranking domain-specific synonyms in the consumer health domain by applying the approach to a new language and domain: identifying Swedish language synonyms in the building construction domain. We chose this setting because identifying synonyms in this domain is helpful for downstream systems, where different users may query for documents (e.g., engineering requirements) using different terminology. We consider two new features inspired by the change in language and methodological advances since the prior work’s publication. An evaluation using data from the building construction domain supports the finding from the prior work that synonym discovery is best approached as a learning to rank task in which a human editor views ranked synonym candidates in order to construct a domain-specific thesaurus. We additionally find that FastText embeddings alone provide a strong baseline, though they do not perform as well as the strongest learning to rank method. Finally, we analyze the performance of individual features and the differences in the domains.

[1]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[2]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[3]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[4]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[5]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Trevor Cohen,et al.  The Semantic Vectors Package: New Algorithms and Public Tools for Distributional Semantics , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[8]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[9]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Ophir Frieder,et al.  Relevance-Ranked Domain-Specific Synonym Discovery , 2014, ECIR.

[12]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[13]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[14]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[15]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[16]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[17]  Marija Stanojevic COGNITIVE SYNONYMY: A GENERAL OVERVIEW , 2009 .

[18]  Masato Hagiwara,et al.  A Supervised Learning Approach to Automatic Synonym Identification Based on Distributional Features , 2008, ACL.

[19]  Jianqiang Li,et al.  Semantic analysis for enhanced medical retrieval , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[20]  Christian Biemann,et al.  Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods , 2016, HLT-NAACL.

[21]  Robert Östling,et al.  Part of Speech Tagging: Shallow or Deep Learning? , 2018, Northern European Journal of Language Technology.

[22]  Genny Tortora,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[23]  Tao Li,et al.  Patent Mining: A Survey , 2015, SKDD.

[24]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[25]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.