A novel method of discovering relation information among entities buried in different nest structures of XML documents is proposed. The method is able to identify relations among different types of entities given by users, and extract relation instances and their occurrence patterns in XML documents. The solution is as follows: identify and collect XML fragments that contain all types of entity given by users at first, then calculate similarity between fragments based on semantics of their tags and their structures, and cluster fragments with a adaptively selected similarity threshold so that the fragments containing the same relation are clustered together, finally extract relation instances and patterns of their occurrences from each cluster. The experimental results show that the method can identify and extract relation information among given types of entities correctly from all kinds of XML documents with meaningful tags.
[1]
Khaled Shaalan,et al.
A Survey of Web Information Extraction Systems
,
2006,
IEEE Transactions on Knowledge and Data Engineering.
[2]
Dekang Lin,et al.
An Information-Theoretic Definition of Similarity
,
1998,
ICML.
[3]
Sergey Brin,et al.
Extracting Patterns and Relations from the World Wide Web
,
1998,
WebDB.
[4]
Neel Sundaresan,et al.
Mining the Web for relations
,
2000,
Comput. Networks.
[5]
Jiawei Han,et al.
Data Mining: Concepts and Techniques
,
2000
.
[6]
Won Kim,et al.
Preparations for semantics-based XML mining
,
2001,
Proceedings 2001 IEEE International Conference on Data Mining.