A Web Corpus and Word Sketches for Japanese

Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries of a word’s grammatical and collocational behaviour), a distributional thesaurus, and robot use. We describe the steps taken to gather and process the corpus and to establish its validity, in terms of the kinds of language it contains. We then describe the development of a shallow grammar for Japanese to enable word sketching. We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of Japanese researchers, learners, and NLP developers.

[1]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[2]  Douglas Biber,et al.  Dimensions of Register Variation , 1995 .

[3]  Kees van Deemter,et al.  Conceptual Coherence in the Generation of Referring Expressions , 2006, ACL.

[4]  Jean-Marc Dewaele,et al.  Variation in the Contextuality of Language: An Empirical Measure , 2002 .

[5]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[6]  Massimiliano Ciaramita,et al.  A Figure of Merit for the Evaluation of Web-Corpus Randomness , 2006, EACL.

[7]  Daisuke Kawahara,et al.  Case Frame Compilation from the Web using High-Performance Computing , 2006, LREC.

[8]  Adam Kilgarriff,et al.  WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators , 2006, EAMT.

[9]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[10]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[11]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[12]  Marco Baroni,et al.  Automated construction and evaluation of Japanese Web-based reference corpora , 2005 .

[13]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[14]  Adam Kilgarriff,et al.  Lexical profiling software and its lexicographic applications: a case study , 2002 .

[15]  Serge Sharoff,et al.  Open-source Corpora: Using the net to fish for linguistic data , 2006 .

[16]  Adam Kilgarriff,et al.  Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[17]  A. Chen A corpus query tool for SLA : learning Mandarin with the help of Sketch Engine , 2007 .

[18]  A. Kilgarriff,et al.  Disambiguating coordinations using word distribution information , 2005 .

[19]  David J. Weir,et al.  Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[20]  Tomaz Erjavec,et al.  JaSlo, a Japanese-Slovene Learners’ Dictionary: Methods for Dictionary Enhancement , 2006 .

[21]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[22]  Kikuo Maekawa Kotonoha , the Corpus Development Project of the National Institute for Japanese Language , 2006 .

[23]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[24]  Adam Kilgarriff,et al.  Slovene Word Sketches , 2006 .

[25]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[26]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[27]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .