Use of Massive Amounts of Web Text in Japanese Named Entity Recognition

In this paper we propose a method to efficiently extract information from large corpora such as text on the Web and use it in named entity recognition. The method extracts knowledge from corpora in the form of an entity list and incorporate the knowledge into feature design of existing supervised models. We conducted exper- iments with a Web corpus comprised of as many as 5 billion sentences and confirmed that our proposed method improves the recall of unknown named entities, which are not included in the training data.