Superior and Efficient Fully Unsupervised Pattern-based Concept Acquisition Using an Unsupervised Parser

Sets of lexical items sharing a significant aspect of their meaning (concepts) are fundamental for linguistics and NLP. Unsupervised concept acquisition algorithms have been shown to produce good results, and are preferable over manual preparation of concept resources, which is labor intensive, error prone and somewhat arbitrary. Some existing concept mining methods utilize supervised language-specific modules such as POS taggers and computationally intensive parsers. In this paper we present an efficient fully unsupervised concept acquisition algorithm that uses syntactic information obtained from a fully unsupervised parser. Our algorithm incorporates the bracketings induced by the parser into the meta-patterns used by a symmetric patterns and graph-based concept discovery algorithm. We evaluate our algorithm on very large corpora in English and Russian, using both human judgments and WordNet-based evaluation. Using similar settings as the leading fully unsupervised previous work, we show a significant improvement in concept quality and in the extraction of multiword expressions. Our method is the first to use fully unsupervised parsing for unsupervised concept discovery, and requires no language-specific tools or pattern/word seeds.

[1]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Simon Dennis,et al.  An exemplar-based approach to unsupervised parsing , 2005 .

[4]  Ari Rappoport,et al.  Translation and Extension of Concepts Across Languages , 2009, EACL.

[5]  Dan Klein,et al.  Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[6]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[7]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[8]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[9]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[10]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[11]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[12]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[13]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[14]  Noah A. Smith,et al.  Annealing Structural Bias in Multilingual Weighted Grammar Induction , 2006, ACL.

[15]  Dominic Widdows,et al.  Using Curvature and Markov Clustering in Graphs for Lexical Acquisition and Word Sense Discrimination , 2004 .

[16]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[17]  Hang Li,et al.  Clustering Words with the MDL Principle , 1996, COLING.

[18]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[19]  Eugene Charniak,et al.  Evaluating Unsupervised Part-of-Speech Tagging for Grammar Induction , 2008, COLING.

[20]  Rens Bod,et al.  Is the End of Supervised Parsing in Sight? , 2007, ACL.

[21]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[22]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[23]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[24]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[25]  James R. Curran,et al.  Scaling Distributional Similarity to Large Corpora , 2006, ACL.

[26]  Rens Bod,et al.  Unsupervised Parsing with U-DOP , 2006, CoNLL.

[27]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[28]  Ari Rappoport,et al.  Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining , 2007, ACL.

[29]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[30]  Rens Bod,et al.  An All-Subtrees Approach to Unsupervised Parsing , 2006, ACL.

[31]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[32]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[33]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[34]  Ari Rappoport,et al.  Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words , 2006, ACL.