The Effect of Corpus Size on Case Frame Acquisition for Predicate-Argument Structure Analysis

This paper reports the effect of corpus size on case frame acquisition for predicate-argument structure analysis in Japanese. For this study, we collect a Japanese corpus consisting of up to 100 billion words, and construct case frames from corpora of six different sizes. Then, we apply these case frames to syntactic and case structure analysis, and zero anaphora resolution, in order to investigate the relationship between the corpus size for case frame acquisition and the performance of predicate-argument structure analysis. We obtained better analyses by using case frames constructed from larger corpora; the performance was not saturated even with a corpus size of 100 billion words.

[1]  Hinrich Schütze,et al.  The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation , 2006, ACL.

[2]  Daisuke Kawahara,et al.  Case Frame Compilation from the Web using High-Performance Computing , 2006, LREC.

[3]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[4]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[5]  Daisuke Kawahara,et al.  A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis , 2006, HLT-NAACL.

[6]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[7]  Daisuke Kawahara,et al.  A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution , 2008, COLING.

[8]  Preslav Nakov,et al.  Solving Relational Similarity Problems Using the Web as a Corpus , 2008, ACL.

[9]  Daisuke Kawahara,et al.  Probabilistic Coordination Disambiguation in a Fully-Lexicalized Japanese Parser , 2007, EMNLP-CoNLL.

[10]  Sadao Kurohashi,et al.  Japanese Named Entity Recognition Using Structural Natural Language Processing , 2008, IJCNLP.

[11]  Malvina Nissim,et al.  Using the Web in Machine Learning for Other-Anaphora Resolution , 2003, EMNLP.

[12]  András Kornai,et al.  Creating Open Language Resources for Hungarian , 2004, LREC.

[13]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[14]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[15]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .

[16]  Makoto Nagao,et al.  Building a Japanese parsed corpus while improving the parsing system , 1997 .

[17]  John A. Keane,et al.  Using Web-Search Results to Measure Word-Group Similarity , 2008, COLING.

[18]  Daisuke Kawahara,et al.  Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames , 2004, LREC.

[19]  James R. Curran,et al.  Web Text Corpus for Natural Language Processing , 2006, EACL.

[20]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[21]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[22]  Kikuo Maekawa Kotonoha , the Corpus Development Project of the National Institute for Japanese Language , 2006 .