论文信息 - A Closer Look at Skip-gram Modelling

A Closer Look at Skip-gram Modelling

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

[1] Ronald Rosenfeld,et al. Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[2] Mari Ostendorf,et al. Variable n-grams and extensions for conversational speech language modeling , 2000, IEEE Trans. Speech Audio Process..

[3] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[4] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[5] Steve J. Young,et al. Speech recognition evaluation: a review of the U.S. CSR and LVCSR programmes , 1998, Comput. Speech Lang..

[6] Hermann Ney,et al. On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[7] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.