Survey of Author Name Disambiguation: 2004 to 2010

Introduction As resources are encoded in various metadata schemes, digital libraries grow, and the internet and metadata encoding in general move toward interoperability, the problems of name and identity disambiguation pose problems in metadata development. Databases and search features must be able to determine whether the person who wrote article A also wrote article B. Searchers may want to call up all items written or created by a particular person. Researchers may need to determine exactly who wrote an article in order to pursue contact that author to propose future collaboration or ask follow questions about the data. Most metadata practices do not easily support name disambiguation and the problem grows as the number of resources and varieties of metadata also grow. Smalheiser and Torvik offer an excellent description of the four main challenges that impact name disambiguation. First, the same individual might write under more than one name due to "orthographic and spelling variations," spelling errors, name changes (for marriage, etc.), the use of pseudonyms or pen names (Smalheiser & Torvik, 2009). Secondly, there are many different people with the same name. Perreria, et al. identify this problem as a situation with polysemes, as opposed to the first case, in which there are synonyms (2009). Some names, like John Smith, appear again and again, creating the challenge of distinguishing one author from another. Thirdly, according to Smalheiser and Torvik, metadata, especially in article databases or on blogs, may be incomplete; many times only the initials of the first name and middle name are included in article databases and not the full name. Lastly, many articles are multi-authored and interdisciplinary (2009). The growing trend of interdisciplinary work makes it more difficult to tell whether the John Smith publishing about linguistics is different from the John Smith publishing in biochemistry, whereas in the past, two identities might be safely assumed. Han, et al. point out that "[n]ame ambiguity can affect the quality of scientific data gathering, can decrease the performance of information retrieval and web search, and can cause the incorrect identification of and credit attribution to authors" (2004). They give an example in Digital Bibliography & Library Project where one author page, which should reflect a single author's work, actually has citations that belong to three separate people (Han, et al., 2004). With so much as stake in resolving the name ambiguity problem, researchers have been working hard on discovering a solution, especially during the past few years. Human or Machine Disambiguation There is disagreement about whether information science researchers should focus on manual or automatic name disambiguation methods. Smalheiser and Torvik list two reasons why manual disambiguation is not always possible. First, there is the problem of very large digital libraries that often harvest their metadata from other sources. Since they are mixing records from many places that each might use their own form of name authority, and since they are huge, it is not practical to manually create and fix name authorities for those libraries (2009). Smalheiser and Torvik's second reason is that internet searches will always looks at many more resources than can practically be manually cataloged (2009). Search engines and other methods for organizing and finding information on the World Wide Web must implement workable automatic name disambiguation methods. In contrast to Smalheiser and Torvik, Veve argues that automated name matching will always fail and so human intervention will always be required (Veve, 2009). Veve explains that "the few endeavors that have tried [name authority control in XML], such as the systems for automated generation of authority control, have only been successful in extracting names from XML records but not in turning them into reliable access points" (Veve, 2009). …