Finding similar identities among objects from multiple web sources

When integrating data from multiple Web sources, objects can exist in different formats and structures, making it difficult to identify those that can be matched together. In this paper, we propose an identification approach to finding similar identities among objects from multiple Web sources. In this approach, object identification works like the relational join operation where a similarity function takes the place of the equality condition. This similarity function is based on information retrieval techniques. Our approach differs from others in the literature since it can be used to identify objects more complexly structured (e.g., XML documents) and not only objects with a flat structure such as relations. The effectiveness of our approach is demonstrated by experimental results with real Web data sources from different domains, that reach precision levels above 75%.