In this paper we develop a disambiguation algorithm and then study its impact on People Search. The proposed algorithm first uses extraction techniques to automatically extract `significant' entities such as the names of other persons, organizations, and locations on each Web page. In addition, it extracts and parses HTML and Web related data on each Web page, such as hyperlinks and email addresses. The algorithm then views all this information in a unified way: as an entity-relationship graph where entities (e.g., people, organizations, locations, Web pages) are interconnected via relationships (e.g., `Web page-mentions-person', relationships derived from hyperlinks, etc). The algorithm gains its power by being able to analyze several types of information: attributes associated with the entities (e.g., TF/IDF for Web pages) and, most importantly, direct and indirect interconnections that exist among entities in the ER graph. We next outline our approach in Section 2 and then compare it with the state of the art solutions in Section 3.
[1]
Julio Gonzalo,et al.
A testbed for people searching strategies in the WWW
,
2005,
SIGIR '05.
[2]
Dmitri V. Kalashnikov,et al.
Exploiting relationships for object consolidation
,
2005,
IQIS '05.
[3]
Dmitri V. Kalashnikov,et al.
Exploiting Relationships for Domain-Independent Data Cleaning
,
2005,
SDM.
[4]
Andrew McCallum,et al.
Disambiguating Web appearances of people in a social network
,
2005,
WWW '05.
[5]
Dmitri V. Kalashnikov,et al.
Domain-independent data cleaning via analysis of entity-relationship graph
,
2006,
TODS.