An Unsupervised Model for Instance Level Subcategorization Acquisition

Most existing systems for subcategorization frame (SCF) acquisition rely on supervised parsing and infer SCF distributions at type, rather than instance level. These systems suffer from poor portability across domains and their benefit for NLP tasks that involve sentence-level processing is limited. We propose a new unsupervised, Markov Random Field-based model for SCF acquisition which is designed to address these problems. The system relies on supervised POS tagging rather than parsing, and is capable of learning SCFs at instance level. We perform evaluation against gold standard data which shows that our system outperforms several supervised and type-level SCF baselines. We also conduct task-based evaluation in the context of verb similarity prediction, demonstrating that a vector space model based on our SCFs substantially outperforms a lexical model and a model based on a supervised parser 1 .

[1]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[2]  Solomon Eyal Shimony,et al.  Finding MAPs for Belief Networks is NP-Hard , 1994, Artif. Intell..

[3]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[4]  Daniel Jurafsky,et al.  How Verb Subcategorization Frequencies Are Affected By Corpus Choice , 1998, COLING.

[5]  Handling structural divergences and recovering dropped arguments in a Korean/English machine translation system , 2000, AMTA.

[6]  Owen Rambow,et al.  Handling Stuctural Divergences and Recovering Dropped Arguments in a Korean/English Machine Translation System , 2000, AMTA.

[7]  Anna Korhonen,et al.  Statistical Filtering and Subcategorization Frame Acquisition , 2000, EMNLP.

[8]  Anna Korhonen,et al.  Semantically Motivated Subcategorization Acquisition , 2002, ACL 2002.

[9]  Ding Yuan,et al.  Natural language generation in the context of machine translation , 2002 .

[10]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[11]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[12]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[14]  John A. Carroll,et al.  The Automatic Acquisition of Verb Subcategorisations and Their Impact on the Performance of an HPSG Parser , 2004, IJCNLP.

[15]  Akshar Bharati,et al.  Inferring Semantic Roles Using Sub-Categorization Frames and Maximum Entropy Model , 2005, CoNLL.

[16]  Roberto Basili,et al.  Verb Subcategorization Kernels for Automatic Semantic Labeling , 2005, ACL 2005.

[17]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[18]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[19]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[20]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[21]  Andy Way,et al.  Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks , 2005, Computational Linguistics.

[22]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[23]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[24]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[25]  Paula Chesley,et al.  Automatic extraction of subcategorization frames for French , 2006, LREC.

[26]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.

[27]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[28]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[29]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[30]  Eric K. Ringger,et al.  Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation , 2007, LAW@ACL.

[31]  Thierry Poibeau,et al.  LexSchem: a Large Subcategorization Lexicon for French Verbs , 2008, LREC.

[32]  Serena Villata,et al.  Automatic extraction of subcategorization frames for Italian , 2008, LREC.

[33]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[34]  Vito Pirrelli,et al.  Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora , 2008, LREC.

[35]  Cédric Messiant,et al.  A Subcategorization Acquisition System for French Verbs , 2008, ACL.

[36]  Lukasz Debowski,et al.  Valence extraction using EM selection and co-occurrence matrices , 2009, Lang. Resour. Evaluation.

[37]  Ari Rappoport,et al.  The NVI Clustering Evaluation Measure , 2009, CoNLL.

[38]  Noah A. Smith,et al.  Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction , 2009, NAACL.

[39]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[40]  Sophia Ananiadou,et al.  Bootstrapping a Verb Lexicon for Biomedical Information Extraction , 2009, CICLing.

[41]  Gertjan van Noord,et al.  Using Unknown Word Techniques to Learn Known Words , 2010, EMNLP.

[42]  Laura Alonso Alemany,et al.  IRASubcat, a highly parametrizable, language independent tool for the acquisition of verbal subcategorization information from corpus , 2010, NAACL.

[43]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[44]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.

[45]  Daisuke Kawahara,et al.  Acquiring Reliable Predicate-argument Structures from Raw Corpora for Case Frame Compilation , 2010, LREC.

[46]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[47]  Mirella Lapata,et al.  Unsupervised Semantic Role Induction with Graph Partitioning , 2011, EMNLP.

[48]  Anna Korhonen,et al.  Hierarchical Verb Clustering Using Graph Factorization , 2011, EMNLP.

[49]  Mirella Lapata,et al.  Unsupervised Semantic Role Induction via Split-Merge Clustering , 2011, ACL.

[50]  Ivan Titov,et al.  A Bayesian Approach to Unsupervised Semantic Role Induction , 2012, EACL.

[51]  Thierry Poibeau,et al.  Multi-way Tensor Factorization for Unsupervised Lexical Acquisition , 2012, COLING.

[52]  Alexander M. Rush,et al.  Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints , 2012, EMNLP-CoNLL.

[53]  Anna Korhonen,et al.  Learning Syntactic Verb Frames using Graphical Models , 2012, ACL.

[54]  Ari Rappoport,et al.  A Diverse Dirichlet Process Ensemble for Unsupervised Induction of Syntactic Categories , 2012, COLING.

[55]  Alexander M. Fraser,et al.  Using subcategorization knowledge to improve case prediction for translation to German , 2013, ACL.

[56]  Helen L. Johnson,et al.  Acquisition and evaluation of verb subcategorization resources for biomedicine , 2013, J. Biomed. Informatics.

[57]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[58]  Anna Korhonen,et al.  Improved Lexical Acquisition through DPP-based Verb Clustering , 2013, ACL.

[59]  Antonio Toral,et al.  First evaluation report. Evaluation of PANACEA v1 and produced resources , 2014 .