Extending Thesauri Using Word Embeddings and the Intersection Method

In many legal domains, the amount of available and relevant literature is continuously growing. Legal content providers face the challenge to provide their customers relevant and comprehensive content for search queries on large corpora. However, documents written in natural language contain many synonyms and semantically related concepts. Legal content providers usually maintain thesauri to discover more relevant documents in their search engines. Maintaining a high-quality thesaurus is an expensive, difficult and manual task. The word embeddings technology recently gained a lot of attention for building thesauri from large corpora. We report our experiences on the feasibility to extend thesauri based on a large corpus of German tax law with a focus on synonym relations. Using a simple yet powerful new approach, called intersection method, we can significantly improve and facilitate the extension of thesauri.

[1]  Stephen Clark,et al.  Specializing Word Embeddings for Similarity or Relatedness , 2015, EMNLP.

[2]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[3]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[4]  Florian Matthes,et al.  Extending Full Text Search for Legal Document Collections Using Word Embeddings , 2016, JURIX.

[5]  Philip Resnik,et al.  WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery , 1992, AAAI 1992.

[6]  Christian Dirschl,et al.  Thesaurus Generation and Usage at Wolters Kluwer Deutschland GmbH (Podcast) , 2016 .

[7]  Makoto Miwa,et al.  Word Embedding-based Antonym Detection using Thesauri and Distributional Information , 2015, NAACL.

[8]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[9]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[10]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Philippe Mulhem,et al.  Toward Word Embedding for Personalized Information Retrieval , 2016, SIGIR 2016.

[15]  Michael Ramscar,et al.  Testing the Distributioanl Hypothesis: The influence of Context on Judgements of Semantic Similarity , 2001 .

[16]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[17]  Luigi Di Caro,et al.  An approach to information retrieval and question answering in the legal domain , 2016 .

[19]  Chen Wang,et al.  Introducing LUIMA: an experiment in legal conceptual retrieval of vaccine injury decisions using a UIMA type system and tools , 2015, ICAIL.

[20]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[21]  Christian Biemann,et al.  Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri , 2013, EMNLP.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Hinrich Schütze,et al.  AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes , 2015, ACL.

[24]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[25]  W. Marsden I and J , 2012 .

[26]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[27]  Edmond Chow,et al.  New Experiments in Distributional Representations of Synonymy , 2005, CoNLL.