As the Web increase drastically, more and more entity information come to appear on Web, including their profile information, their web log containing their idea, activity, speech and so on. However, there are many entities sharing same names. Such entities include persons, locations and so on. This paper presents an approach to estimate the number of entities sharing same name by employing many-to-one features. The basic idea is that the entities are not likely to share all other features even if they have the same name. We list some strategies for selecting key features, present an approach to extract the features on Web, and combine them to estimate the entity number. What is more, we also give a method to identify the fake information which will confuse us and filter them. The referents of a same name appearing on Web are difficult to distinguish due to lack of features which can be used to identify them. Originally, the name is a key feature used to identify one entity from the others. should be a many-to-one relation to prevent misunderstanding. However, while more and more entities come to appear on Web, such relation has been broken out and it becomes to be many-to-many relation. One name may potentially refer to tens or hundreds of entities. One can't just use name to identify an entity on Web because of homonym exists so commonly. Our purpose is to estimate lower bound of referents' number of a name in name list, so we ignore name recognition in general process of name disambiguation. We present a method based on choosing some special features which have many-to-one relations, including profiles of the entities and relations between entities. As to a person name, the person's birthday and his parent's name may be important features. Use these features, combined with the person's name, we are able to distinguish this person from other homonyms. How to choose features and extract features should be considered first. The data on web is always redundant and full of feature patterns. We use iterative pattern relation extraction to make our method scalable. Many web pages just contain entity name and don't contain the features we want. And we will also use query expansion to avoid data sparse.
[1]
Nina Wacholder,et al.
Disambiguation of Proper Names in Text
,
1997,
ANLP.
[2]
David Yarowsky,et al.
Unsupervised Personal Name Disambiguation
,
2003,
CoNLL.
[3]
Sergey Brin,et al.
Extracting Patterns and Relations from the World Wide Web
,
1998,
WebDB.
[4]
Edoardo M. Airoldi,et al.
A Network Analysis Model for Disambiguation of Names in Lists
,
2005,
Comput. Math. Organ. Theory.
[5]
Andrew McCallum,et al.
Disambiguating Web appearances of people in a social network
,
2005,
WWW '05.
[6]
Gregory R. Crane,et al.
Disambiguating Geographic Names in a Historical Digital Library
,
2001,
ECDL.
[7]
Eduard H. Hovy,et al.
Learning surface text patterns for a Question Answering System
,
2002,
ACL.