论文信息 - Identifying similar objects in social networks and digital libraries

Identifying similar objects in social networks and digital libraries

With the rise of the computer age, various kinds of information can be easily accessed in digital format. However, the objects found within this information, such as people, places, dates, and firms, form a tangled and complex relationship that is usually challenging to untangle. In this dissertation, we aim to unravel the relationship among objects to the finest extent: what are the similarity levels between any pairs of objects. Discovering similar objects can be the foundation of several research problems and applications. For example, objects can be clustered into several groups by merging similar objects together. This merging process can be recursively performed such that a hierarchical structure of these terms is constructed. In addition, the hidden relationship among objects can be inferred by examining the similar objects that do not explicitly interact with each other. This dissertation examines the problem of discovering similar objects in two different settings: (1) discovering similar objects based on the interaction among them, and (2) discovering similar objects based on their meta-data. We will mainly focus on the first setting. The interactions among objects are modeled by a network structure, in which each node represents one object, and an edge is presented if the two objects have interacted with each other. In the second setting, we examine the similarity problem where additional information other than interacting history is available. In the second setting, we targeted digital library objects, such as papers, authors, published venues (i.e., the published conference or journal), etc. The meta-data of these objects could be, for example, the citation counts of the paper, the affiliation of the author, and the topics of the conference. These meta-data are utilized to infer the similar objects, such as similar terms, similar venues, or relevant authors given a topic. To validate our proposed models and methodologies, we conducted various experiments on several different data sets to discover the hidden relationship among the target objects. This includes (1) the relationship between the authors, papers, and venues in the given digital library, (2) the actors, actresses, and the movies in the given movie information, and (3) the diseases and the genes of patients. In addition, we implemented two live systems based on CiteSeerX digital library to bring several of these research results into practical products. The first system, CollabSeer, recommends potential collaborators based on a user’s research interest and previous coauthoring behaviors. The second one, CSSeer, recommends a list of experts given a term of interest based on the similarity score between the query term and the publication and citation history of the authors. Both systems are highly efficient in handling more than one million papers and over 300 thousand disambiguated authors.

Hung-Hsuan Chen | C. Lee Giles