A Hierarchical Playscript Representation of Distributed Words for Effective Semantic Clustering and Search

Semantic word embeddings have shown to cluster in space based on linguistic similarities that are quantifiably captured using simple vector algebra. Recently, methods for learning distributed word vectors have progressively empowered neural language models to compute compositional vector representations for phrases of variable length. However, they remain limited in expressing more generic relatedness between instances of a larger and non-uniform sized body-of-text. A recent study proposed a formulation that combines a word vector set of variable cardinality to represent a verse, with an iterative distance metric to evaluate similarity in pairs of non-conforming verse matrices. In this work, we expand on this sentence abstraction and apply it to a dialogue text passage that is prescribed in a playscript and uttered by an actor. In contrast to baselines characterized by a bag of features, our model preserves word order and is more sustainable in performing semantic matching at any of a dialogue, act, and play levels. To validate our framework for training word vectors, we analyzed the clustering of the complete play set of Shakespeare by exploring multidimensional scaling for visualization, and experimented with playscript searches of both contiguous and out-of-order parts of dialogues. We report robust results that support our intuition for measuring play-to-play and dialogue-to-play similarity.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[6]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[11]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Richard Socher,et al.  A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[14]  Wanxiang Che,et al.  Learning Semantic Hierarchies via Word Embeddings , 2014, ACL.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[17]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[18]  Joachim M. Buhmann,et al.  Multidimensional Scaling and Data Clustering , 1994, NIPS.

[19]  Avi Bleiweiss,et al.  A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search , 2017, ICAART.

[20]  Joseph L. Zinnes,et al.  Theory and Methods of Scaling. , 1958 .

[21]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.