Generalised Brown Clustering and Roll-Up Feature Generation

Brown clustering is an established technique, used in hundreds of computational linguistics papers each year, to group word types that have similar distributional information. It is unsupervised and can be used to create powerful word representations for machine learning. Despite its improbable success relative to more complex methods, few have investigated whether Brown clustering has really been applied optimally. In this paper, we present a subtle but profound generalisation of Brown clustering to improve the overall quality by decoupling the number of output classes from the computational active set size. Moreover, the generalisation permits a novel approach to feature selection from Brown clusters: We show that the standard approach of shearing the Brown clustering output tree at arbitrary bitlengths is lossy and that features should be chosen insead by rolling up Generalised Brown hierarchies. The generalisation and corresponding feature generation is more principled, challenging the way Brown clustering is currently understood and applied.

[1]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[2]  Leon Derczynski,et al.  Tune Your Brown Clustering, Please , 2015, RANLP.

[3]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[4]  Claude E. Shannon,et al.  The zero error capacity of a noisy channel , 1956, IRE Trans. Inf. Theory.

[5]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[6]  Mark Steedman,et al.  Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[7]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[8]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[9]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[10]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[11]  Karl Stratos,et al.  Model-based Word Embeddings from Decompositions of Count Matrices , 2015, ACL.

[12]  Timothy Baldwin,et al.  Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representations on Sequence Labelling Tasks , 2015, CoNLL.

[13]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[14]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[15]  Karl Stratos,et al.  A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language , 2014, UAI.

[16]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[17]  Gertjan van Noord,et al.  From neighborhood to parenthood: the advantages of dependency representation over bigrams in Brown clustering , 2014, COLING.

[18]  Ivan Titov,et al.  Word Representations, Tree Models and Syntactic Functions , 2015, ArXiv.

[19]  Grzegorz Chrupala,et al.  Hierarchical clustering of word class distributions , 2012, HLT-NAACL 2012.

[20]  Leon Derczynski,et al.  generalised-brown: Source code for AAAI 2016 paper , 2015, AAAI 2015.

[21]  Alessandro Moschitti,et al.  Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction , 2013, ACL.

[22]  Phil Blunsom,et al.  A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction , 2011, ACL.

[23]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[24]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..