An Information Extraction Method from Different Structural Web Sites by Word Distances between a User Instantiated Label and Similar Entity

This paper addresses an information extraction from different structural web sites by using a user instantiated example. A user instantiated example consists of labels as criteria for decision making on purchasing a target product or service and instances related to the labels. When information extraction method outputs the information in table form, labels are used as column heading of table and instances are used as instances filled in the table. Because there are various labels and information that does not correspond to the target on the web site, it is difficult to extract the target information related to the target. Information of the target tends to be written in a similar string to the instances that is called "similar entity". And target information is written close to the labels. So, the proposed method extracts information using the number of words among a user instantiated label and similar entities. Additionally, in order to extract a piece of information described across the web sites, the proposed method extracts information from linked web sites that are similar to the web site used for a user instantiated example. Experimental results show that the proposed method can extract information at recall rate of 65% and precision rate of 91%.