Learning Chinese Entity Attributes from Online Encyclopedia

Automatically constructing knowledge bases from free online encyclopedias has been considered to be a crucial step in many internet related areas. However, current research pays more attention to extract knowledge facts from English resources, and there is less work concerning other languages. In this paper, we describe an approach to extract entity attributes from a free Chinese online encyclopedia-HudongBaike. We first identified attribute-value pairs from HudongBaike pages that are featured with InfoBoxes, which in turn can be used to learn which attributes we should pay attention to for different HudongBaike entries. We then adopted a keyword matching approach to identify candidate sentences for each attribute in a plain HudongBaike article. At last, we trained a CRF model to extract corresponding values from these candidate sentences. Our approach is simple but effective, and our experiments show that it is possible to produce large amount of triples from free online encyclopedias which can be then used to construct Chinese knowledge bases with less human supervision.